Single nucleotide polymorphisms (SNPs) contribute most of the genetic variation to the human genome. SNPs associate with many complex and common diseases like Alzheimer’s disease (AD). Discovering SNP biomarkers at different loci can improve early diagnosis and treatment of these diseases. Bayesian network provides a comprehensible and modular framework for representing interactions between genes or single SNPs. Here, different Bayesian network structure learning algorithms have been applied in whole genome sequencing (WGS) data for detecting the causal AD SNPs and gene-SNP interactions. We focused on polymorphisms in the top ten genes associated with AD and identified by genome-wide association (GWA) studies. New SNP biomarkers were observed to be significantly associated with Alzheimer’s disease. These SNPs are rs7530069, rs113464261, rs114506298, rs73504429, rs7929589, rs76306710, and rs668134. The obtained results demonstrated the effectiveness of using BN for identifying AD causal SNPs with acceptable accuracy. The results guarantee that the SNP set detected by Markov blanket based methods has a strong association with AD disease and achieves better performance than both naïve Bayes and tree augmented naïve Bayes. Minimal augmented Markov blanket reaches accuracy of 66.13% and sensitivity of 88.87% versus 61.58% and 59.43% in naïve Bayes, respectively.
One of the important study subjects about human genome is the investigation of genetic variants related to complex diseases. Most of these genome-wide association (GWA) studies [
A SNP is a single nucleotide site where exactly two (of four) different nucleotides occur in a large percentage of the population. SNPs can contribute to complex disorders in two different ways, either by changing the structure of a specific protein or by changing the abundance of the protein [
A genetic association study aims to find statistical associations between genotypes (genetic variants) and phenotypes (traits or disease states) and thus to identify genetic risk factors [
Alzheimer’s disease (AD) is a brain disease identified by slowly progressing memory failure, confusion, poor judgment, and, ultimately, death [
Bayesian learning is a successful method to learn the structure of data in different applications. Here are some reasons why we choose Bayesian methods. Bayesian methods provide several structure learning algorithms. They provide models of causal influence and allow us to explore causal relationships, perform explanatory analysis, and make predictions. Finally, Bayesian networks provide a way to visualize results. As an alternative, machine learning methods, such as Random Forest (RF), have identified potential causal variants on risk for complex diseases like AD [
Recent studies have been attempted to correlate high-throughput single nucleotide polymorphism (SNP) data with large-scale imaging data [
Many genes have been linked to the disorder. However, only a minority of them are supported by a sufficient level of evidence. Among all SNPs, only SNPs, belonging to the top 10 AD candidate genes listed on the AlzGene database [
The paper is organized as follows. Section
Our goal was to apply Bayesian network structure learning (BNSL) to detect Alzheimer’s disease potential causal SNPs. Furthermore, identifying SNPs interacted with causal SNPs in addition to the causal SNPs themselves [
The main stages of the proposed system are described in the workflow shown in Figure
Summary of the proposed system.
Whole genome sequencing (WGS) data of 812 individuals were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (As such, the investigators within the ADNI contributed to the design and implementation of ADNI and/or provided data but did not participate in analysis or writing of this report. A complete listing of ADNI investigators can be found at
The used subset of the ADNI data includes 282 controls, 442 MCI, and 48 AD as the baseline diagnosis. We selected SNPs belonging to the top ten AD candidate genes listed on the AlzGene database using PLINK program. The total SNP-genotype fields are 496 single SNPs. Table
The top candidate genes and the number of SNPs among each gene.
Gene | Chromosome | Number of SNPs | Potential pathways |
---|---|---|---|
APOE | 19 | 6 | Cholesterol/lipid metabolism |
BIN1 | 2 | 101 | Endocytic pathways |
CLU | 8 | 32 | Immune and cholesterol/lipid metabolism |
ABCA7 | 19 | 36 | Cholesterol/lipid metabolism; immune and complement systems/inflammatory response |
CR1 | 1 | 71 | Immune and complement systems/inflammatory response |
PICALM | 11 | 138 | Endocytic pathways |
MS4A6A | 11 | 12 | Immune and complement systems/inflammatory response |
CD33 | 19 | 13 | Immune and complement systems/inflammatory response |
CD2AP | 6 | 61 | Endocytic pathways; immune and complement systems/inflammatory response |
An initial quality control based filtering with PLINK [
Subsequently SNPs whose minor allele frequency is less than 0.01 and whose Hardy-Weinberg
Whole genome sequencing (WGS) data used in this study have been gathered from 812 ADNI participants between normal, MCI, and AD. So the phenotype data for the particular patient and this information have been matched with genotype information. We used the phenotype representation of 1 and 2 for normal and AD groups, respectively, according to the baseline exam.
This section explores the Bayesian network approach and its applicability to understand the genetic basis of disease. Bayesian networks are a type of probabilistic graphical models (PGMs) that can represent the conditional dependencies and independencies between a set of random variables via a Directed Acyclic Graph (DAG) [
A BN is defined by two models, structural
Genome-wide association studies (GWASs) aim to identify gene-SNPs involved in human disease or may contribute as a risk factor for developing a complex disease. In order to understand how gene networks contribute to a certain disease, Bayesian networks have been used to represent the relationship between genetic variants and a phenotype (disease status).
The following subsections present different classification algorithms supported by BayesiaLab [
The least complex structure is the naïve Bayes structure (NB structure), which supposes that predictor variables are conditionally independent given the class. It means ignoring interactions between attributes within individuals of the same class. In naïve Bayes structure all variables are children of the target variable. A Bayesian classifier structure has been created from training data, but this typically requires the probabilities for each variable node given the class variable and the prior probabilities of the class [
The augmented naïve Bayesian algorithm begins with an NB structure but relaxes the conditional independence assumption between the child variables. After creating the standard NB structure, a greedy search algorithm has been used to find connections between the child nodes. In tree augmented naïve Bayes (TANB) structure the class variable has no parents and each variable node has at most two parents, one of them is the class variable [
It is an algorithm that searches the nodes belonging to the Markov blanket of the target node, that is, fathers, sons, and spouses. The knowledge of the values of each node of this subset of nodes makes the target node independent of all the other nodes. The search of this structure, which is entirely focused on the target node, makes it possible to obtain the subset of the nodes that are really useful much more quickly than other algorithms like naïve Bayesian. Furthermore, this method is a very powerful selection algorithm and is the ideal tool for the analysis of a variable [
Minimal augmented Markov blanket starts with the Markov blanket structure and then uses an unsupervised search to find the probabilistic relations between each of the variables belonging to the Markov blanket. MAMB allows reducing the set of nodes, and it results then in a more accurate target analysis [
Bayesian network structural learning has been used to establish a causal relationship or dependency between SNPs in the network and to identify the most efficient path towards AD diagnosis. We introduced a framework for comparing different Bayesian network algorithms to achieve the highest performance improvements. We randomly selected 20% of the dataset as Test Set and consequently the remaining 80% served as our Learning Set. Expectation Maximization algorithm has been used to handle missing values in BN learning. It is an iterative method in which it uses other variables to guess a value (Expectation) and then checks whether that value is the most likely (Maximization). If not, it reguesses more likely values. This repeats until it reaches the most likely value [
We have managed network complexity via the Structural Coefficient (SC) parameter. Various experiments for different range values of SC were carried out to find relationships/links between the variables. These experiments indicated that choosing SC value to be 0.25 for MB and MAMB worked much faster and found significant relationships between the variables.
We have applied four different supervised algorithms (naïve Bayes, tree augmented naïve Bayes, Markov blanket and minimal augmented Markov blanket) to predict the state of the diagnostic variable, that is, normal or AD. The four resulting Bayesian networks for the classification were shown in Figures
(a) Naïve Bayes structure. (b) Tree augmented naïve Bayes structure.
The network structure of (a) Markov blanket algorithm and (b) minimal augmented Markov blanket.
Top related SNPs with Alzheimer’ disease using minimal augmented Markov blanket (SNPs kgp11800793 and kgp5536625 overlapped as they have the same mutual information with AD).
SNP APOE112 located in the APOE gene on chromosome 19 presents a significant score of association with AD. SNP APOE112 was the first correlated SNP with AD that resulted from the four Bayesian models. This result confirms that APOE is the highest known AD risk factor. SNP rs769449 located in APOE on chromosome 19 was the second correlated SNP with AD that resulted from both NB and TANB, while kgp15578484 (rs7530069) located in CR1 gene on chromosome 1 was the second correlated SNP with AD that resulted from both MB and MAMB.
Some of the SNPs in our study that were shown to be associated with AD risk have been previously identified in other studies like APOE112, rs4844609, rs769449, rs4732729, rs9331942, rs610932, and rs611267.
Other new SNPs were observed to be significantly associated with Alzheimer’s disease. These SNPs are rs7530069, rs113464261, rs114506298, rs73504429, rs7929589, rs76306710, and rs668134. Some other SNPs previously observed to be associated were tested in our study and were not significant. The reason that our results did not include these SNPs was due to an insufficient sample size. Further studies may be needed in larger populations with larger numbers of SNPs.
The overall performance can be expressed as the total precision, which is computed as the total number of correct predictions (true positives + true negatives) divided by the total number of cases in the Test Set. Standard accuracy comparisons were carried out for the four algorithms on all the datasets. Prediction accuracy results, sensitivity, and specificity are reported in Table
Prediction accuracy results, sensitivity, and specificity for various used algorithms.
Algorithm | Accuracy | Sensitivity | Specificity | Number of SNPs |
---|---|---|---|---|
Naïve Bayes | 61.58% | 59.43% | 65.6% | 435 |
Tree augmented naïve Bayes | 64.29% | 67.55% | 58.16% | 435 |
Markov blanket | 65.64% | 77.55% | 43.26% | 13 |
Minimal augmented Markov blanket | 66.13% | 88.87% | 16.31% | 11 |
The table also indicated the number of predictor SNPs that resulted from each algorithm. For naïve and tree augmented naïve networks a total of 435 distributed SNPs out of 496 SNPs were considered as predictors. However, the number of predictor SNPs reduced to 13 and 11 for Markov blanket and minimal augmented MB, respectively, with higher accuracy.
We evaluated the performance of these four BN structures using 10-fold cross validation. The dataset was randomly partitioned into ten approximately equal sets such that each set had a similar proportion of individuals who developed AD. We applied the algorithms on nine sets taken together as the training data and evaluated the classifier performance on the remaining test data. We repeated this process for each possible test set to obtain an AD prediction for each individual in the dataset. We used the predictions to compute the Receiver Operating Characteristic (ROC) curve which is a widely used measure of classification performance. ROC graphs allowed a broader comparison of classifiers than that available from a single-value metric such as accuracy estimation and may reveal different trends in performance. Figure
Comparative ROC curve of the four resulting structures.
Prediction of complex disease phenotypes from high-throughput genotype data is an emerging research goal. Gene-SNP connectivity and its association with AD can provide critical insights into the underlying mechanisms and identify SNPs that may serve as effective targets for therapeutic intervention. Here we have introduced a framework for the use of four different Bayesian network methods on whole genome sequencing datasets to establish causal relationships among genes and between genes and Alzheimer’s disease.
In conclusion, we identified several significant polymorphisms associated with AD, in the APOE, CR1, CD33, CLU, PICALM, and ABCA7 genes. Some of them were previously identified whereas others were novel biomarkers. These results demonstrated the effectiveness of using BN for identifying AD causal SNPs with acceptable accuracy. We hope that our work will facilitate reliable identification of SNPs that are involved in the etiology of Alzheimer’s diseases, ultimately supporting timely identification of genomic disease biomarkers, and development of personalized medicine approaches and targeted drug discoveries.
The authors declare that there is no conflict of interests regarding the publication of this paper.
Data collection and sharing of this project were funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense Award no. W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging and the National Institute of Biomedical Imaging and Bioengineering and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd. and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (