Maximal Information Coefficient-Based Testing to Identify Epistasis in Case-Control Association Studies

Interactions between genetic variants (epistasis) are ubiquitous in the model system and can significantly affect evolutionary adaptation, genetic mapping, and precision medical efforts. In this paper, we proposed a method for epistasis detection, called EpiMIC (epistasis detection through a maximal information coefficient (MIC)). MIC is a promising bivariate dependence measure explicitly designed for rapidly exploring various function types equally and for interpreting and comparing them on the same scale. Most epistasis detection approaches make assumptions about the form of the association between genetic variants, resulting in limited statistical performance. Based on the notion that if two SNPs do not interact, their joint distribution in all samples and in only cases should not be substantially different. We developed a statistic that utilizes the difference of MIC as a signal of epistasis and combined it with a permutation resampling strategy to estimate the empirical distribution of our statistic. Results of simulation and real-world data set showed that EpiMIC outperformed previous approaches for identifying epistasis at varying degrees of heredity.


Introduction
Genome-wide association studies (GWAS) is an emerging research strategy for discovering associations between genetic variation (e.g., single nucleotide polymorphism (SNP)) and traits like human diseases. More than 71,000 SNPs have been confirmed to be related significantly to diseases [1][2][3] since the first GWAS study was published in Science in 2005 [4]. The majority of these markers, however, are common genetic variants with small effects. Even though the whole genome sequencing enables us to detect several rare variants with large effect, "missing heritability" for the complex disease remains extensive [5][6][7]. For instance, only 75% of the phenotypic variance of Alzheimer's disease has been explained by known variants [8]. One possible explanation for "missing heritability" is that complex diseases are poly-genic, with multiple genes, environmental variables, and interactions involved in their etiology [9,10]. Genetic interactions are thought to provide a potential answer to the problem of "missing heritability." The solution may be partial, but it may help develop novel gene pathway topologies [11].
Epistasis is generally detected in two ways: biologically and statistically. Bateson and Mendel [12] introduced the concept of biological epistasis, which evaluates the interdependence of genetic variants. It occurs when the effect of one allele on one genetic mutation is dependent on the presence or absence of another genetic mutation and subsequently suppresses or activates the expression of other genes. Several studies reported novel epistasis in diseases. For instance, interactions between SNPs have been associated with pulmonary tuberculosis [13,14], recurrent miscarriage [15], polycystic ovary syndrome [16], and many more [17][18][19][20]. These findings highlight the potential and significance of epistasis research.
Statistical epistasis, coined by Fisher [21], is defined as the deviation from additive effects of genetic mutations at separate loci in terms of their overall contribution to the model. Biological epistasis, on the other hand, refers to the physical interaction of two or more biological components. Studies in model organisms [22][23][24] have shown that epistasis found using computational and statistical approaches may be physiologically connected in these species. The presence of statistical epistasis, however, does not imply the presence of biological epistasis. Bridging the statistical and biological epistasis gap is a crucial step toward understanding the underlying genetic architecture of complex diseases.
Currently available approaches for detecting statistical epistasis can be divided into three groups depending on their strategy: exhaustive methods, search methods, and machine learning-based methods [25,26]. Exhaustive methods have evaluated the association of all SNP combinations with phenotypes. Wan et al. [27] proposed BOOST, a multistage exhaustive approach that uses bitwise storage technology to speed up logistic regression test calculation. Zhang et al. [28] developed TEAM that calculates contingency tables by introducing a minimum spanning tree structure to detect pairwise SNP interactions. Ritchie et al. [29] used multifactor dimensionality reduction (MDR) to identify epistasis, which reduced the multiple SNP combinations into one dimension with high risk and low risk. It is one of the widely used methods in this field, and many methods have been developed based on it [30][31][32][33].
Exhaustive methods can effectively avoid omitting epistasis detection, but it requires massive computation. Stochastic techniques and heuristic searches are examples of search algorithms. The performance of stochastic methods involves random sampling and probability calculation. BEAM was created by Zhang and Liu [34] to find epistasis by partitioning SNPs into three nonoverlapping groups based on their posterior probability using Markov Chain Monte Carlo sampling. Schork et al. introduced EpiMODE [35], which combined the epistasis module idea with a Gibbs sampling strategy. Heuristic search is an approximation search guided by heuristic information that can reduce the search space and find the optimal solution effectively but may be limited by local optimal solutions. EpiACO [36] and AntEpiSeeker [37], both based on ant colony optimization, are examples of this sort of approach. Epi-GTBN [38] is an epistasis search approach that incorporates genetic algorithms to the Bayesian network heuristic search strategy.
SNP epistasis is also detected using machine learningbased approaches such as the neural network [39], support vector machine [40], random forest, or association rules [41]. SNPrule [42] is an epistasis detection method based on learning predictive rules, and by identifying the predictive rules involved in epistasis, higher-order epistasis may be inferred. EpiForest [43], random Jungle [44], and SNPInterforest [45] are examples of random forests that have been used in GWAS. They treated the random forest output as the most crucial variable set. This paper introduces EpiMIC (epistasis detection via maximal information coefficient), an epistasis detection method that uses the maximal information coefficient (MIC) to identify marker-level interactions of complex diseases in case-control studies [14,46]. MIC is a good bivariate dependency measure explicitly designed for rapid exploration of almost all types of data relationships, which means it can detect linear, exponential, and cyclical functions. Specifically, it can detect various function types equally, interpret them, and compare them on the same scale. We establish a statistic that utilizes the difference of MIC as an indicator of the occurrence of epistasis and also use the permutation resampling strategy to learn our statistic's empirical distribution. In simulated data sets with a variety of parameters, our method has demonstrated outstanding performance in finding underlying paired epistasis. Its use of WTCCC (Wellcome Trust Case Control Consortium) rheumatoid arthritis (RA) data shows accurate epistasis detection.

Materials and Methods
This section describes the EpiMIC statistical framework. The various parameter choices for simulation studies are presented to assess the power to detect type-I error and pairwise epistasis. Then, using the rheumatoid arthritis data set from the WTCCC database, we evaluated the efficiency of our method in a real-world setting.

EpiMIC
2.1.1. Preliminaries and Notation. Suppose we have n random samples with a collection of p SNPs, then the observed genotypes R p can be represented by a n × p matrix: where g l,i is the random variable that models the genotype for SNP i of l th sample. It is a categorical variable with three levels denoted by g l,i ∈ fAA, Aa, aag = f0, 1, 2g. The homozygote genotypes are represented by AA and aa, whereas the heterozygote condition is represented by Aa. A and a denote the major and the minor alleles of SNP i, respectively. The value indicates the copy number of SNP i's minor alleles. In case-control studies, y i ∈ f0, 1g is a categorical label where 0 represents a control subject and 1 represents a case subject.
To determine whether there was a statistical interaction between two SNPs in a case-control study, we developed a statistic through the maximal information coefficient to quantify the strength of pairwise epistasis. We then used the permutation resampling strategy to estimate the statistic's empirical distribution. Intuitively, any pair of SNPs may have an original dependency or not without the phenotypic background. EpiMIC tried to capture the conditional dependency between a pair of SNPs under a disease status, which is the task-related correlation. It is based on the idea that because control samples are frequently picked at random, the epistasis pattern in case samples is more  [46] is an efficient measure of reliance for bivariate associations that encompasses a wide range of functional and not functional associations. It has two heuristic properties: generality and equitability. Generality means that if the sample size is sufficient, MIC covers not only certain types of functions but also various interesting associations or functional relationships that are not well modeled by a function, such as a superposition of functions. Equitability means MIC should assign identical ratings to similarly noisy associations of all kinds. Let D ⊂ R 2 be a bivariate finite collection of ordered pairs, with x values and y values of D separated into x bins and y bins, respectively. An ðx, yÞ grid is the term given to such a pair of partitions. The distribution Dj G for a grid G can be determined from the data points in D on the cells of G. It is calculated by taking the probability mass in each cell and dividing it by the proportion of data points in D that fall into that cell. Given a constant D, different grids G produce distinct distributions of Dj G . With positive integers ðx , yÞ, define I * ðD, x, yÞ as where the maximum is across all grids ðx, yÞ and IðDj G Þ represents the mutual information of Dj G .
The characteristic matrix MðDÞ of a bivariate data collection D is an infinite matrix with elements: With Equation (3), the MIC of a bivariate data set D with n samples and a grid size smaller than BðnÞ is defined as follows: where ωð1Þ < BðnÞ ≤ Oðn 1−ε Þ for some 0 < ε < 1. The function BðnÞ upper binds the sizes of the grids over which MIC searches. Usually, its default setting is n 0:6 because it works well in practice.
To calculate MðDÞ, it is optimized ideally on all possible grids. But in practice, MIC uses dynamic programming algorithms that optimize only a subset of possible grids for computational efficiency, and it seems to be approaching the true value of MIC.
MIC satisfies the following properties: Calculate the difference ΔMIC 0 between MIC all n 1 +n 2 ðg i , g j Þ and MIC D n 2 ðg i , g j Þ ; 5 for k= 1 to mdo 6 Randomly shuffle label y, and generate the new data set; 7 Repeat Steps 3 and 4; 8 end 9 The estimated p-value of ΔMIC 0 is the number of ΔMIC i , i = 1, ⋯, m, larger than ΔMIC 0 , divided by m. 10 end 11 end 12 Output all the p-value for each pair of SNPs.
Algorithm 1: EpiMIC. 3 Computational and Mathematical Methods in Medicine cases. Let MIC n ðg i , g j Þ be the sample correlation score between the i th SNP and the j th SNP. First, we calculated the MIC all n ðg i , g j Þ for all the samples and MIC D n 2 ðg i , g j Þ for case samples. Second, we devised a statistic ΔMIC = jMIC all n ðg i , g j Þ − MIC D n 2 ðg i , g j Þj/MIC all n ðg i , g j Þ to compare the MIC in cases and in all samples. ΔMIC denotes how dissimilar the relationship ðg i , g j Þ was in cases and across all samples.
The greater the ΔMIC, the more likely it is that g i and g j interacted.
We wanted to learn the empirical distribution of ΔMIC 0 under the null hypothesis to derive a p value. In this case, we employed a nonparametric permutation strategy: first, shuffled the y with m times, computed ΔMIC by the same procedure described above, and used the resultant sample distribution as an estimate for the distribution of ΔMIC. If In Algorithm 1, we summarized the EpiMIC procedure and showed the whole workflow in Figure 1.
2.2.1. Simulation with GAMETES. The performance of the EpiMIC to detect marker-based, pairwise epistasis was examined in this simulation study. We assigned 10 SNPs to each simulation data set. There were two functional SNPs and eight nonfunctional SNPs among them. To produce the simulated genotype data, we used the freely accessible program GAMETES [47]. This program was created to produce pure and strict epistasis models, which are the most challenging to discover if all n-loci are included in the disease model. Because of this requirement, these models are a desirable gold standard for simulation research on complex epistasis [47,48].
(1) Type-I Error Evaluation. Type-I error demonstrates a method's capacity to reject the null hypothesis when it is true. We utilized the GAMETES to create two custom disease models without epistasis ( Table 1). The baseline odd was denoted by γ. We conducted the simulation for each model 100 times with the following sample size n ∈ f1k, 2k , 3k, 4k, 5kg, γ = 1, and θ = 5. The threshold of significance was fixed at 0.05.
(2) Evaluation of Test Power. The power of a test reflects the likelihood that the procedure will properly accept the alternative hypothesis when the null hypothesis is false. This simulation study used two experimental setups: epistasis models without marginal effects (NME) and epistasis models with marginal effects (ME).
For each parameter setting in the NME scenario, we created 100 data sets. The power under each parameter value was stated as the frequency at which the approach successfully rejects the null hypothesis at a significance level of α = 0:05.
(i) We used h ∈ f0:005, 0:01,0:025,0:05,0:1,0:2g and two distinct minor allele frequencies ðMAFÞ ∈ f 0:2,0:4g to analyze the influence of heritability h. Five models were developed for each parameter combination, yielding 60 models following Hardy-Weinberg proportions. For all these models, population prevalence was set to 0.2, and the sample size was set to 4,000. The five models were labeled M1 to M5, and they were sorted in general by rising customized odds ratio (COR) using GAMETES. COR is a detectability statistic derived directly from the genetic model. The higher the value, the simpler it is to identify epistasis (ii) To assess the effect of sample size, we set heritability to 0.025, MAF ∈ f0:2, 0:4g, and prevalence to 0.2, with a sample size of 10,000. Then, using this big data set, 100 data sets were created at random for each of the sample size n ∈ f1k, 2k, 3k, 4k, 5kg. In this case, we had a total of 1,000 data sets In the ME scenario, we generated six models in accordance with Namkung et al. [49]. For each model, 100 replicated data sets with balanced case subjects and control subjects were constructed with a sample size of 4,000 ( Table 2).
For BEAM, MDR, BOOST, and Epi-GTBN, let the number of data sets where they identified the epistasis correctly be m 1 , then the power can be determined using the following formula: We ran BEAM and Epi-GTBN with the default parameter setting. MDR and BOOST had no parameters to be specific. In EpiMIC, n 0:7~n0:8 is effective experimentally. We use n 0:8 as a default parameter.

Experiments Using Data from Rheumatoid Arthritis.
To test EpiMIC's capacity to handle true epistasis in a casecontrol data set, we examined the susceptibility of a series of pairings of SNPs in rheumatoid arthritis (RA), an inflammatory disease characterized by pannus development in synovial joints and cartilage and bone loss. The detailed data set construction can be found in our previous work [48].

Results and Discussion
All results were obtained on a workstation equipped with an Intel Xeon CPU E5-2620 v2 @ 2.10 GHz, 96 GB of DDR3, R 4.0.3, and RStudio programming implementation.

Simulation Study
3.1.1. Type-I Error Evaluation. For type-I error, we set MAF to 0.2 and population prevalence to 0.2, then ranged sample sizes from 1,000 to 5,000. For the no-effect model without epistasis, all the methods tested had a type-I error comparable to the significance level α = 0:05 (Table 3(a)). For the disease model without epistasis, but with one marginal SNP, BEAM and BOOST still controlled type-I error, although MDR, Epi-GTBN, and EpiMIC had little inflation. The result implied that we should choose a lower significance level in practical application to reduce the probability of false positive results.

Evaluation of the Power of EpiMIC
(1) The Influence of Heritability. We investigated two types of epistasis disease models to assess the statistical strength of our EpiMIC and the other four methods: epistasis models without marginal effects (NME) and epistasis models with marginal effects (ME).
In the NME scenario, we examined 12 heritability-MAF combinations, with heritability ranging from 0.005 to 0.2 Table 3: Type-I error for methods BEAM, BOOST, MDR, Epi-GTBN, and EpiMIC. Sample sizes varied from 1,000 to 5,000 under two disease models: (a) no effect model with no epistasis between two SNPs and (b) one marginal recessive model with no epistasis between two SNPs.    Heritability had a significant impact on the power of all methods, and the power increased monotonically with an increase in h under a certain MAF (Table 5). Heritability ranged from 0.005 to 0.01, and all methods demonstrated a consistent rising trend ( Table 5). The power was also affected by the epistasis SNP pair's minor allele frequencies (MAF). Although BEAM fluctuated under model M1 with MAF = 0:4 and h = 0:05, for other methods, the increase in MAF was evident in the improved performance, especially when h = 0:005. Heritability is the effect size of epistasis.
When it was small, the larger MAF increased the chances for causal genotypic combinations of epistasis SNP pairs to emerge in simulation data sets. For example, for the cases  7 Computational and Mathematical Methods in Medicine of h = 0:005, the average power of BEAM was 0.876 with MAF = 0:2, which was lower than 0.912 for MAF = 0:4. Although the performance of the methods under the same model was different, the epistasis detected by the highpower method did not entirely cover epistasis detected by the low-power method. Because these methods were based on different definitions of epistasis, the methods could not simply replace each other; instead, they had a complementary relationship.
It is worth mentioning that, as compared to BEAM or MDR, EpiMIC, BOOST, and Epi-GTBN were more stable for disease models M1 to M5 with varying COR under the same heritability-MAF combination. In the ME scenario, the power to detect epistasis for all methods achieved a 1 that the epistasis disease model with marginal effect was easier to analyze than models without marginal effect.
(2) The Influence of Sample Size. Let the sample size be n ∈ f1k, 2k, 3k, 4k, 5kg, with h = 0:005, and MAF = 0:2 ( Figure 2). As the sample size increased, the power of all methods increased almost monotonically. A larger sample size corresponds to improved performance in all methods.
In conclusion, EpiMIC had superior or comparable performance to detect purely and strictly epistasis in simulation studies, which was the most difficult disease-related patterns with or without marginal effects. EpiMIC benefited from the powerful ability of MIC to capture a wide range of relationships and our designed statistic Δ MIC. If two SNPs interacted under a specific disease model, SNPs showed a more obvious relationship in some cases. Adding control samples decreases the strength of this relationship.
However, the nature of EpiMIC made it likely that it was affected by linkage disequilibrium (LD); because LD is a strong dependency, SNPs in a strong LD block may produce false positives.

Experiments
Using Rheumatoid Arthritis Data. RA is an autoimmune disease in which IL-6, RANK, and TNF − α are key hereditary risk factors [50]. In the RA investigation, each  Figure 3: Variant network of rheumatoid arthritis results from the EpiMIC model with identified SNP pairs. The nodes were SNPs, and the edges represented the epistasis relationship. Node size and color reflected the number of epistasis that the node involved in. Edge thickness indicated the maximal information coefficient of SNPs in case samples. The node labels with highlights were the top 15 SNPs ranked by node degree. 8 Computational and Mathematical Methods in Medicine unique SNP pair of the hsa05323 pathway was analyzed, yielding C 2 385 = 73, 920 total pairings for 385 SNPs. We chose 522 results with a significance level α = 0:005; after filtering SNP pairs in the same gene, we got 517 epistasis to do the following analysis.
We generated the epistasis network ( Figure 3) from 517 epistasis using the network analysis software Gephi, in which the nodes were the SNPs with epistasis and the edge indicated the epistasis relationship. We generated Figure 3 by running the Multigravity Force Atlas algorithm, which prevented nodes from overlapping and controlled the scale of the expansion of the graph while clustering interconnected nodes. The degree of each node represents the number of epistasis that it was involved in, and the edge was weighted by the MIC of pairs of SNPs in case samples. The average degree of the network was 3.009. We filtered the nodes with degree < 5, then ordered the node size and color by its degree. Nodes labeled purple were the top 15 nodes ranked by the node's degree. Table 6 gives a detailed information of the top 15 nodes, which included their degree, the gene where the SNP was located, and the genes where the interacting SNPs were located. We grouped the interacting SNPs into genes. The higher the node's degree was, the greater the chance that it interacted with more genes. But the number of genes that showed epistasis did not increase monotonically with the degree of a node; due to the different numbers of SNPs that were contained in each gene in the data set and their underlying interval LD pattern, one SNP may be detected as epistatic with multiple SNPs in the LD region from a long gene. From the network, we found some valuable hub genes, such as IL15, CD28, Ang1, Tie2, LFA1, and TLR4, which had at least five interacting genes in the RA pathway. For instance, IL-15 [51], which is a member of the 4 α-helix bundle cytokine family, was detected in the serum of RA patients and synovial fluid and in mouse models of arthritis. In addition, the administration of IL-15 led to the development of severe inflammatory arthritis, indicating that IL-15 may be related to RA treatment. Targeting IL-15 is very critical and valuable.
We also analyzed the top 10 epistasis ranked by the MIC in case samples for each pair of SNPs ( Table 7) and found that five of the top 10 results were supported by prior research [52][53][54][55]. For example, the first epistasis was between gene CD80 (rs4675363) and CTLA4 (rs1427676). Costimulatory molecules have a crucial role in the immunoregulatory regulation of T lymphocyte-mediated immunological and inflammatory responses [53]. The best-studied costimulatory signaling pathway was CD28/CTLA4-CD80/ CD86. CTLA-4 is a structural homolog of CD28, and it binds the CD80 and CD86 ligands. CTLA-4, on the other hand, has a 20-to-50-fold greater affinity to CD28, which Table 6: Detailed information of the top 15 nodes ranked by the node's degree of SNP epistasis network generated using EpiMIC with RA data. The column "corresponding gene" indicates the gene where SNP was located, and the column "gene interaction" shows the genes where the interacting SNPs were located.

Conclusions
Epistasis between genetic variants is ubiquitous and crucial in uncovering the underlying genetic structure of complex diseases and traits. In this paper, we developed EpiMIC (epistasis detection via maximal information coefficient), which combined maximal information coefficient (MIC) with permutation strategy for case-control studies in GWAS. We transformed the epistasis detection problem by measuring the degree of difference of MIC between pairwise SNPs in cases and in all samples. The method benefits from the powerful ability of MIC to explore various function types equally and to interpret and compare them on the same scale. Because of the weak assumptions about the nature of epistasis and MIC's powerful and practical capacity to capture complicated functional and nonfunctional correlations, our method accurately and effectively recognized additional sorts of interpretable epistasis.
To assess EpiMIC's performance, we conducted simulated and retrospective investigations. For most of the settings tested, EpiMIC's statistical power to detect epistasis was better or equivalent to prior methods, and its power grew monotonically with heritability, MAF, and sample size. Based on a test of type-I error, the method was shown to be stable to sample size. Simulation results also indicated that epistasis detected by the high-power method did not entirely cover epistasis detected by the low-power method. These methods were based on different definitions of epistasis that methods could not simply replace each other but had a complementary relationship. In our analysis of real data, we found several key genes of RA from an epistasis network and significant epistasis that was supported by prior research. We found that local LD inflated EpiMIC's statistical power slightly. SNP pairs in the same gene with a LD pattern were detected as epistasis occasionally. In practice, we suggest filtering out SNP pairs within a local LD structure or select tagSNPs to detect epistasis. Moreover, due to different sequencing coverage, some causal SNPs may be missing in some data sets, and only markers linked to it are found to be epistasis. Therefore, it would be better to locate the interacting genes through marker-based epistasis first and then to combine more biological information to locate the real causal epistasis. In conclusion, EpiMIC is a useful addition to the current toolkit of statistical methods for elucidating epistasis in GWAS case-control studies.

Data Availability
Publicly available data sets were analyzed in this study. This data can be found here: https://www.wtccc.org.uk/info/ access_to_data_samples.html.