High Order Gene-Gene Interactions in Eight Single Nucleotide Polymorphisms of Renin-Angiotensin System Genes for Hypertension Association Study

Several single nucleotide polymorphisms (SNPs) of renin-angiotensin system (RAS) genes are associated with hypertension (HT) but most of them are focusing on single locus effects. Here, we introduce an unbalanced function based on multifactor dimensionality reduction (MDR) for multiloci genotypes to detect high order gene-gene (SNP-SNP) interaction in unbalanced cases and controls of HT data. Eight SNPs of three RAS genes (angiotensinogen, AGT; angiotensin-converting enzyme, ACE; angiotensin II type 1 receptor, AT 1 R) in HT and non-HT subjects were included that showed no significant genotype differences. In 2- to 6-locus models of the SNP-SNP interaction, the SNPs of AGT and ACE genes were associated with hypertension (bootstrapping odds ratio [Boot-OR] = 1.972~3.785; 95%, confidence interval (CI) 1.26~6.21; P < 0.005). In 7- and 8-locus model, SNP A1166C of AT 1 R gene is joined to improve the maximum Boot-OR values of 4.050 to 4.483; CI = 2.49 to 7.29; P < 1.63E − 08. In conclusion, the epistasis networks are identified by eight SNP-SNP interaction models. AGT, ACE, and AT 1 R genes have overall effects with susceptibility to hypertension, where the SNPs of ACE have a mainly hypertension-associated effect and show an interacting effect to SNPs of AGT and AT 1 R genes.


Introduction
The renin-angiotensin system (RAS) represents a critical endocrine regulator for maintaining blood pressure and blood fluid volume in the circulatory system. Single nucleotide polymorphisms (SNPs) of RAS genes such as angiotensinogen (AGT) [1,2], angiotensin-converting enzyme (ACE) [3,4], and angiotensin II type 1 receptor (AT 1 R) [5,6] are known to be associated with cardiovascular diseases [7][8][9]. For example, the SNP G-217A of AGT gene but not the SNPs A-6G and M235T of AGT gene may associate with hypertension in patients from Taiwan [1]. The I allele of ACE gene and +1166 C allele of AT 1 R gene are reportedly associated with hypertension [10]. However, these studies were mainly relying on the association with hypertension using single SNP models and rare SNP effects were commonly ignored.

BioMed Research International
Accumulating evidence indicates that high order genegene (SNP-SNP) interaction can deeply affect disease susceptibility. For example, the A1166C of AT 1 R gene and I/D of ACE gene have synergistic effects on acute myocardial infarction [11]. The interactions between T174M, M235T, G-6A, A-20C, G-152A, G-217A of AGT gene, I/D of ACE gene, and A1166C of AT 1 R gene have been examined in coronary artery disease [12]. A significant effect of gene-gene interaction in coronary artery disease was detected for G-217A and M235T of AGT gene and I/D of ACE gene. Additionally, joint effects of genegene interactions were discovered in blood pressure regulation [13], left ventricular mass [14], and acute myocardial infarction [11]. However, detecting gene-gene interactions remains a challenge due to a large number of possible SNP combinations.
To date, several computational methodologies have been proposed to detect the epistasis in many association studies [15][16][17][18][19][20][21][22]. Data mining and statistical analysis are a common approach to overcome computational challenges in detecting complex gene-gene interactions. For example, multifactor dimensionality reduction (MDR), a nonparametric statistical method, is commonly used for detecting possible gene-gene interactions in multigene causing diseases [23,24]. However, this common MDR is only suitable for a balanced number of cases and controls. The original data sets of many association studies are usually unbalanced. Therefore, some information in real data set might get lost after resampling.
Here, we describe a case-control study of hypertension susceptibility that specifically evaluates gene-gene interactions using unbalanced function based MDR [25] that combines traditional statistical methods with novel computational algorithms. The unbalanced function based MDR uses the ratio between the percentages of cases in each genotype combination of case data and the percentage of controls in each genotype combination of control data. This is to classify by MDR classifier, to analyze possible gene-gene interactions associated with hypertension. Subsequently, the misclassification errors of multiple SNPs associated with high or low risks of hypertension can be computed. To examine the high order SNP-SNP interactions of RAS genes in hypertension, 8 SNPs were chosen, namely, the T174M/M235T/G-6A/A-20C/G-152A/G-217A of AGT gene, I/D of ACE gene, and A1166C of AT 1 R gene. We aimed to find out the influence of these 8 SNPs on hypertension outcomes. The results show that unbalanced function based MDR can avoid the drawback of common MDR in an unbalanced real data set. Thus, the best unbalanced function based MDR model can correctly predict high order SNP-SNP interactions of hypertension susceptibility using real data sets.

Methods
The MDR method was briefly introduced and the unbalanced function based MDR method was explained in detail as follows.

MDR.
In 2001, Ritchie et al. proposed a MDR to detect the potential gene-gene interaction. MDR is a robust nonparametric method that detects nonlinear interactions among multiple discrete genetic factors [23]. It is accomplished by data classifier technology to combine two or more attributes into a single attribute. Thus, representation of data space can be changed, and high-order gene-gene interactions can be evaluated by statistical classifiers. Figure 1 illustrates the MDR procedure that produces the best model by the following algorithm.
Step 1. Divide the data set into 10 subsets for cross-validation (CV).
Step 2. Keep the th data set as the testing data and others are the training data.
Step 3. Calculate the total number of cases and the total number of controls within each multifactor class.
Step 4. Evaluate the ratio between cases and controls in each genotype combination (i.e., a cell in n × n grid).
Step 5. Determine the ratio of high (H)/low (L) risks in each multifactor class. If the cases/controls ratio particular threshold, it is labeled with "H"; otherwise it is labeled with "L".
Step 6. Compute the four frequencies of true positive (TP), false positive (FP), true negative (TN), and false negative (FN) in a 2-way contingency table.
Step 7. Evaluate the misclassification error.
Step 8. Repeat for each combination.
Step 9. Select the best model according to minimum misclassification error and record it into cross-validation consistency (CVC).
Step 10. Repeat for each CV interval.
Step 11. Select the best model according to the model with the highest frequency in CVC.
In MDR procedure, the original data are randomly sorted and divided into 10 subsets for CV. In each CV interval, 9 of 10 subsets are classified as the training data and the remaining one as independent testing data. The loci and a possible multiloci class are represented in the following -dimensional space: The value of is designated depending on the number of factors being considered. Then, a set of genetic factors is selected. The total numbers of cases or controls are counted in the multifactor class, and the ratio of the numbers of cases to controls is calculated. From (2), the multifactor class count and ratio can be obtained as follows:   where where acronyms represent the following: : the case data set; : the control data set; * : the number of case group in the training set; * : the number of control group in the training set; : the vector of variable combinations.
The function ( ) indicates a match if all parameters in vector match their cases or controls, then given a score of "1, " otherwise, given the score "0. " Next, the high/low risk in each multifactor class is determined. Each multifactor class in n-dimensional space is labelled as "H" or "L" symbol. Label "H" indicates that ratio in the multifactor class meets or exceeds a particular threshold (high-risk group); otherwise, label is "L" (low-risk group). The threshold is equal to the one in a balanced data set. Thus, the huge genotype combinations in -loci are reduced into a 2way contingency table (TP, FP, TN, and FN) and allow the statistical analysis to evaluate the -loci effect. MDR uses the misclassification error to evaluate a model value where the misclassification error function is calculated as follows: where acronyms represent the following: TP: the total number of labeled "H" in the case data; FP: the total number of labeled "H" in the control data; FN: the total number of labeled "L" in the case data; TN: the total number of labeled "L" in the control data. After all the multifactor combinations are evaluated by misclassification error, the MDR model with the minimum error rate individual is chosen and the model is considered the best model of training data at -fold. The training model is then used to test the testing data and record the TP, FP, FN, and TN for evaluating the statistical power. Thus, MDR repeats the above procedure in each CV interval. The MDR model with the lowest number of misclassified individuals is chosen and ten best models are classified by the same combination in the CVC. Finally, the highest occurring frequency in CVC is considered the best model. If a tie between 2 or more models happens, then the first appearing model is considered the best model.

Unbalanced Function Based MDR.
In unbalanced function based MDR [25], the ratio between the percentages of cases in each genotype combination of case data (i.e., a cell in × grid for cases) to the percentages of controls in each genotype combination of control data (i.e., a cell in × grid for controls) is proposed to classify the data to the high-and low-risk groups. Thus, the highest ratio between case and control groups can be clearly detected. The strategy is to modify the ratio between cases and controls in the ratio function of MDR, that is, (2). The following equation is introduced to calculate the ratio (percentage) of cases to controls: where where acronyms represent the following: : the cases data set; : the control data set; * : the number of case group in the training set; * : the number of control group in the training set; : a vector of variable combinations.
The function ( ) is a match (given a score of "1") if all parameters in vector match their cases or controls; a mismatch is given the score "0. " Our strategy is to modify the misclassification error rate function of MDR, that is, (4). Equation (7) proposed by Velez et al. [26] is introduced into MDR; therefore, the two classes are equally responsible for both positive and negative errors due to the class imbalance. The equation evaluates the misclassification error rate according to the arithmetic mean of sensitivity and specificity. The adjusted misclassification error is algebraically identical to the error rate if the data set is imbalanced. Consider where acronyms represent the following: TP: the total number of labeled "H" in the case data; FP: the total number of labeled "H" in the control data; FN: the total number of labeled "L" in the case data; TN: the total number of labeled "L" in the control data.

Study Population.
This was a single center, case-control study. A detailed description of the subject collection has been published previously [1,3]. In brief, hypertensive and normotensive patients (HT and non-HT subjects) were recruited from an outpatient clinic of the National Taiwan University Hospital from July 1995 through June 2002. The non-HT subjects were from the same areas as the hypertensives and had no history of hypertension, diabetes mellitus, renal insufficiency, significant hepatic disease, or apparent coronary artery disease. The basic characteristics of the HT and non-HT groups have been described previously [27] and are shown in Table 1. The demographic and laboratory data were collected from the medical chart records. The study protocols were reviewed and approved by a local institutional committee. All subjects gave informed consent as approved by the institutional review board at this hospital.

Statistical Analysis.
The power statistical analysis was implemented by the G * power 3.1.5 tool [28,29]. The SNPs were evaluated by their odds ratios (OR), 95% CI, and values. OR was used to measure the risk of disease; values indicate significant differences between the cases and controls. All statistical analyses were implemented using SPSS version 19.0 (SPSS Inc., Chicago, IL).

Data Set.
The hypertension data set with hypertension ( = 313) and nonhypertension ( = 130) was obtained from our previous study [27]. The complete genotype data set is available at http://bioinfo.kmu.edu.tw/non-HT and HT genotype data.xlsx. Eight SNPs were included: T174M (rs4762), M235T (rs699), G-6A (rs5051), G-217A (rs5049), G-152A (rs11568020), A-20C (rs5050), I/D (rs4646994), and A1166C (rs5186) of three RAS genes (AGT, ACE, and AT 1 R). However, the possible SNP-SNP interaction was not examined. Here, we used the unbalanced function based MDR with minimum misclassification error rate to identify the best SNP-SNP interaction model with significant differences between hypertension (HT, cases) and nonhypertension (non-HT, controls) groups. Table 1 shows the basic characteristics of the HT and non-HT groups. HT patients had a significantly higher risk for male gender, age, systolic blood pressure (BP(S)), diastolic blood pressure (BP(D)), and cholesterol. Body height, body weight, body mass index, cigarette smoking, and triglyceride were similar between HT and non-HT groups. The age, systolic blood pressure, diastolic blood pressure, and cholesterol of the hypertensives were significantly higher than those of the normotensives in Table 1. Table 2 shows the performance ( values of chi-square test) of each individual SNP. Among these eight SNPs, most individual SNPs paired with any genotype show no significant difference ( > 0.05) between the HT and non-HT groups. The frequency difference of SNP I/D of ACE gene between HT and non-HT groups is significant when based on chi-square test (ID and DD, = 0.031 and 0.010, resp.). However, it is not significant after a Bonferroni correction ( > 0.006, i.e., 0.05/8).

Multilocus Analyses:
Determination of the Best Model. All significant 2-locus SNP-SNP interactions (Table 3) are known to define the epistasis risk score which are collectively referred to as an epistasis network. Although some 2-locus models have higher OR values and lower values, the best model was selected according to the model with the highest frequency in CVC which consisted of the model with the lowest error rate in each CV. Among these models, the lowest error of the best model is 0.419 for a 2-locus model (AGT G-217A + ACE I/D). Similarly, all the best models in 3-to 8-locus models are listed in Table 4. Table 4 summarizes the results of the unbalanced function based MDR analysis for the best 2-to 8-locus models. Consistency data indicate that including more SNPs leads to a higher occurrence of hypertension. As the loci number increases, the prediction error rates were reduced from 41.9 to 32.6. In other words, the correct prediction was 58.1∼67.4%. An 8-locus model had a minimum prediction error of 32.6%. Based on the null hypothesis of no association, it is impossible that an error rate ≤32.6 is observed by chance in randomized data. The 2-to 6-locus models suggest that those SNPs of the AGT and ACE genes were associated with hypertension. Both 7-locus and 8-locus models suggest that the listed SNPs in AGT, ACE, and AT 1 R genes were important in association with hypertension. Additionally, power analysis represents the degree of rejection for H0 that is significant at = 0.05. Applying the MDR method, the testing data results were always not significant (at > 0.05). Thus, we defined H0 as the result of the test set is the same as for the training set (H0), and H1 shows that the result of the test set was different from the training set. The powers in 2-to 8-locus, ranging from 0.901 to 0.999, indicate that occurrence probabilities in all models are higher than 0.9. These findings suggest that all these 8 SNPs are significantly associated with hypertension. Table 4, the occurrences of frequency differences between HT and non-HT groups are different, the best 2-to 8-locus models generated from unbalanced function based MDR are significant ( < 0.01, data not shown). The OR values in 2-to 8-locus models increase from 2.054 to 4.628. For the implementation of bootstrapping in 1000 samples, the adjusted OR (Boot-OR) values increase from 1.972 to 4.483 in 2-to 8-locus models and the values of 2-to 8-locus models decrease from 0.003 to 1.48E-09. Both OR and Boot-OR values gradually increase when the loci numbers increase indicating that the hypertension risk is increasingly raised by the joint effect of SNPs. It also suggests the SNPs of AGT, ACE, and AT 1 R genes are highly associated with hypertension risk. SNPs G-217A (AGT) and I/D (ACE) occur in all best models of 2 to 8 loci in association with hypertension. The associated effects of A1166C (AT 1 R) are detected at the best 7-and 8-locus models and this leads to the highest risk compared to other models.

Discussion
Many important genes associated with hypertension were reported [30][31][32], but most of them are based on a single-SNP model and the potential joint effects of multiloci models were less addressed. The single-SNP model of our current study, I/D (ACE), is significantly associated with hypertension using the chi-square test. However, it is nonsignificant after Bonferroni's correction (Table 2). However, it was reported that nonsignificant SNPs when combined generate joint effects that are associated with diseases [33]. Thus, the effects of some SNPs may be ignored in a single-SNP model. Accordingly, gene-gene interaction analysis was chosen in this study to identify the possible joint effect of these nonsignificant SNPs in association with hypertension.
MDR is a robust analysis for a gene-gene interaction based on detecting nonlinear multigene interactions. MDR also limits the balanced study population to, respectively, determine the high and low risk for cases and controls. Therefore, the MDR is unsuitable for the majority of natural data sets which commonly belong to imbalanced cases and controls. The threshold = 1 can effectively distinguish between high and low risks in each genotype combination (i.e., a cell in × grid of high-and low-risks) of MDR (steps 4 and 5 in Figure 1) but faults in the imbalanced data set. Although resampling techniques are widely applied to fit for MDR detecting epistasis in imbalanced data sets, the possible information missing by resampling is hard to be excluded. In contrast, we demonstrated that our proposed unbalanced function based MDR is suitable for an imbalanced data set. For example, Figure 2 illustrates details of the computational process such as TP, TN, misclassification error rate, the total number of high-risk groups, and the total number of low-risk groups, which were obtained from a 2-locus SNP-SNP interaction model in MDR and unbalanced function based MDR. Figures 2(a) and 2(c) show the processes of the model selections and Figures 2(b) and 2(d) are the corresponding details for the models of Figures 2(a) and 2(c) in terms of the numbers of high-and low-risk groups. In Figure 2(a) for MDR, values of error rates show around 0.3 in 100 models of MDR, and TPs are always higher than TNs. Although TPs are slowly increased in all models, all error rates do not remain improved clearly. At the best model of MDR, the sensitivity and specificity are 0.0089 and 1, respectively. In Figure 2(c), for unbalanced function based MDR, the TNs are not always higher than the TPs. The error rates are clearly improved when there are small difference values between TP and TN. The sensitivity and specificity of the best model are 0.667 and 0.495, respectively. Figure 2(b) for MDR clearly shows that high-risk groups in 100 models are BioMed Research International 7     always higher than low-risk groups due to the imbalanced data between cases ( = 313) and controls ( = 130) whereas Figure 2(d) for the unbalanced function based MDR shows the better frequencies of the numbers of high-and low-risk groups. Thus, (5) and (7) are effective in overcoming the MDR detecting multiloci interactions in imbalanced data sets. In summary, MDR may fail to correctly assign genotypes of multiloci to either high-or low-risk groups and does not provide correct error rates when the datasets are unbalanced. Thus, low error rate and high OR value of MDR may be due to its high TN. However, its TP value is low and indicates a low sensitivity for disease detection. For hypertension association, AGT gene haplotype (T174M, M235T, G-6A, A-20C, G-152A, and G-217A) had been reported to interact with I/D of the ACE gene [3]. However, the role of A1166C of AT 1 R gene in interacting with this AGT gene haplotype was not investigated. Because the six SNPs in the AGT gene are bound together due to its haplotype environment their potential interaction to SNPs of other genes may be limited. However, the joint effect of multiple SNPs between different genes may have a higher association degree than that of a single significant SNP. In the example of a natural data set with an unbalanced HT group and non-HT group, the unbalanced function based MDR algorithm is able to detect the significant association with hypertension in terms of 2-to 8-locus models by their OR and Boot-OR values ( Table 4). The hypertension associated performance of all SNPs is additive from 2-to 8-locus models. The SNP appearing order is the same as the order of SNP joint effects in terms of the additive OR value (i.e., OR +1 − OR ; Table 4 The SNP-SNP interaction networks (Table 4) have further been validated by the single-SNP-to-single-SNP interaction analyses (Figure 3) for the best 2-to 8-locus models in terms of OR values. SNPs involved in one or more significant interactions are represented as nodes, and the pairs of SNPs with significant interactions are connected by lines. For example, all SNPs are significantly associated with ACE I/D, suggesting that ACE I/D is mainly associated with hypertension. The G-217A and ACE I/D are integrated to the best 2-locus model. The left scale of vertical axis is the log 10 value for the total numbers of TP and TN. The right scale of vertical axis is the error rate. Blue line is the error rate based on (7), denoted by "Adjust Err". The horizontal axis represents the 100 different models in 2-locus combinations, in which the models are sorted by error rate and selected by systematic sampling from all models. (b and d) The high/low lines indicate the distribution difference between the numbers of high-and low-risk groups.
The G-6A is further integrated to the best 3-locus model and it is validated that G-6A has positively interacted with G-217A and ACE I/D (OR > 1). Similarly, the other newly integrated SNPs in each multiloci model have positively interacted with the previous SNPs in each multiloci model.
Misclassification error is widely used as misclassification performance that aims to correctly estimate the proportions for an incorrect prediction. In MDR, the incorrect prediction error is an internal validation of a measurement that protects against finding chance associations in the sample. The misclassification errors of these multiloci models are much lower than 50%, indicating that the chance associations are significantly reduced. Table 4 also shows that the error rates are gradually reduced from low-to high-order interaction, suggesting that our proposed models are much effective for misclassification of risk of diseases.
In conclusion, hypertension is resulting from the interaction of several genetic risk factors. Analyses of multiple SNP-SNP interactions are complex and remain computational challenges when huge numbers of genetic factors are simultaneously considered. Moreover, the unbalanced data set may have to be analyzed with a bias due to the limitation nature of the MDR approach. In contrast, our proposed algorithm can constitute a nonparametric statistical analysis and provide a model-free and high-order-way measurement for epistasis without the limitation of a balanced data set. Accordingly, a significant outcome can be discovered from the high-order SNP-SNP interaction model amongst an unbalanced data set of many diseases including hypertension. Our results suggest that AGT, ACE, and AT 1 R genes have an overall hypertension susceptibility effect. Among them, SNP I/D of ACE has the main association effect to hypertension and it also displays norder interaction effect to SNPs of AGT and AT 1 R genes although they do not have a mutual effect on each other. The unbalanced function based MDR model can explore the epistasis network of SNPs AGT, ACE, and AT 1 R of RAS genes and identify strongly significant hypertension association. These interaction models and epistasis networks amongst  AGT, ACE, and AT 1 R genes may be regarded as potential biomarkers for hypertension susceptibility. Our results also demonstrate that this powerful method has a potential to identify several disease-associated multiloci models as susceptibility biomarkers.