Candidate Gene Discovery Procedure after Follow-Up Confirmatory Analyses of Candidate Regions of Interests for Alzheimer’s Disease in the NIMH Sibling Dataset

The objective of this research was to develop a procedure to identify candidate genes under linkage peaks confirmed in a follow-up of candidate regions of interests (CRIs) identified in our original genome scan in the NIMH Alzheimer’s diseases (AD) Initiative families (Blacker et al. [1]). There were six CRIs identified that met the threshold of multipoint lod score (MLS) of ≥ 2.0 from the original scan. The most significant peak (MLS = 7.7) was at 19q13, which was attributed to APOE. The remaining CRIs with ‘suggestive’ evidence for linkage were identified at 9q22, 6q27, 14q22, 11q25, and 3p26. We have followed up and narrowed the 9q22 CRI signal using simple tandem repeat (STR) markers (Perry et al. [2]). In this confirmatory project, we have followed up the 6q27, 14q22, 11q25, and 3p26 CRIs with a total of 24 additional flanking STRs, reducing the mean interval marker distance (MID) in each CRI, and substantially increase in the information content (IC). The linkage signals at 6q27, 14q22 and 11q25 remain ‘suggestive’, indicating that these CRIs are promising and worthy of detailed fine mapping and assessment of candidate genes associated with AD. We have developed a bioinformatics approach for identifying candidate genes in these confirmed regions based on the Gene Ontology terms that are annotated and enriched among the systematic meta-analyzed genes, confirmed by at least three case-control samples, and cataloged in the “AlzGene database” as potential Alzheimer disease susceptibility genes (http://www.alzgene.org).


Introduction
The most common form of dementia among aging people is Alzheimer's disease (AD), which involves years, but up to 22.2% of the population over 90 years of age [5]. The number of sufferers worldwide is estimated as over 20 million [6] with more than 5 million affected in the United States.
AD is clinically subdivided into early (< 65 years) and late ( 65 years) onset forms. About 10% of AD cases are familial with an autosomal dominant inheritance and these cases are often the early onset form [7]. Mutations in three genes, amyloid precursor protein (APP) on chromosome 21, presenilin 1 (PS1) on chromosome 14, and presenilin 2 (PS2) on chromosome 1 are estimated to account for about 50% of earlyonset AD [8][9][10]. However, the majority of cases (90-95%) are late-onset AD (LOAD) that can show familial clustering without a clear Mendelian mode of inheritance [11]. Apolipoprotein E (APOE) on chromosome 19q13 has been confirmed by multiple independent studies [12] as a risk factor for LOAD, and is associated with lowering the age of onset [1]. However, APOE explains only 20-29% of the risk [13], and it is neither essential nor sufficient to cause AD [14][15][16]. The etiology of LOAD is complex with the possible involvement of several genes and/or environmental factors [17].
Efforts to identify additional LOAD loci have largely taken two main approaches: genome-wide linkage scans [1,[18][19][20][21] and association studies of polymorphisms in candidate genes (for review see [22]). These studies indicate the existence of additional AD susceptibility genes on several chromosomes. In our genome-wide linkage scan of the National Institute of Mental Health Alzheimer's Disease Genetics Initiative (NIMH-ADGI) cohort of affected siblings, we identified five CRIs (9q22, 6q27, 14q22, 11q25, and 3p26) with suggestive linkage and one CRI on 19q23 that met criteria for 'significant' evidence of linkage defined by multipoint LOD scores (MLS) 2.0 [1]. Other linkage studies, though considerable overlap between samples from NIMH-ADGI sets [14], have identified overlapping regions or regions that are within a modest distance of three of these CRI on 19q23, 9q22, and 11q25 [15][16][17]. The 14q23 region was identified by an independent scan that used only Caribbean Hispanic samples [23], and serves as a separate replicate.
We followed up the CRI signal on chromosome 9q22 by genotyping additional simple tandem repeats (STRs) and found the region remained significant with an increase in the peak MLS from 2.9 to 3.8 at 95 cM and narrowing of the CRI to 11 cM (92-103 cM), thus supporting the region as potentially harboring LOAD genes [2]. In the present study, with the aim of confirm-ing our previous whole-genome scan findings, we have conducted a follow-up study with denser STR markers spanning the four remaining CRIs (3p26, 6q27, 11q25, and 14q22). Based on the Gene Ontology (GO) terms that are annotated and enriched among the systematic meta-analyzed genes, confirmed by at least three case-control samples, and cataloged in the "AlzGene database" as potential Alzheimer disease susceptibility genes, we developed bioinformatics tools that extract potential AD candidate genes in these CRIs from genomic databases.

Study population: NIMH AD genetic initiative families
The study subjects were collected as part of the NIMH Genetics Initiative following a standardized protocol utilizing the NINCDS-ADRDA criteria for diagnosis of definite, probable, and possible AD [24,25]. A total of 468 families were ascertained. The primary structures of these families were affected sibpairs [1], and the ethnic make-up was primarily Caucasian (95%). In 437 of these families, the mean age of onset (MAO) of affected family members was above 50 (mean = 72.4, range = 50-97) (Those with MAO 50 were believed to be enriched for the APP, PS1, and PS2 mutations, and were therefore dropped from the current analyses). In addition to the total set (TS, n = 437 families), the linkage and mapping results presented here are also from a late age at onset subset with a MAO 65 (LOAD families) identified in 320 families and a subset of families identified as early-mixed (EM) with MAO between ages 51 and 65 (n = 117), both with similar primary structure and ethnic make-up as the total set of families. Blood was collected and sent to the NIMH repository at Rutgers University where genomic DNA was extracted from lymphocyte cell lines.

STR genotyping
STRs flanking the 6q27, 14q22, 11q25, 3p26 peaks were chosen from the genomic database (www.gdb.org) and deCODE Genetics (www.nature.com/ng/journal/ v31/n3/ extref/ng917-S13.xls). The genetic locations of the STRs were based on sex average distances in the Rutgers combined linkage-physical map, Build 36.1 (compgen.rutgers.edu/mapomat). A total of 24 STRs were genotyped in an attempt to decrease the inter- marker genomic distance in these four peak regions in these LOAD families, which was 9 cM in the original scan. The STRs genotyped for the original and follow-up mapping are listed in Table 1.

Linkage analysis
The statistical analyses have been detailed previously [2]. In brief, model-free linkage analysis was performed using the program Genehunter Plus with extensions to calculate the Kong and Cox statistic [27,28]. Maximum likelihood estimates of allele frequencies were calculated with the SAGE [29] program FREQ taking into account the family relationships. Replicates were performed on selected samples and any Mendelian errors were detected with the SAGE program MARK-ERINFO, as well as detected implicitly by the analytical programs used here.

Bioinformatics approach for candidate gene selection in CRIs
Once a linkage peak has been identified, hundreds of genes under the peak can be accessed using the UC-SC genome browser (http://genome.ucsc.edu). But the lists are too large to conduct expensive molecular lab work. We developed a bioinformatics approach using a Python (http://www.python.org) script to assist in the efficient automatic extraction of candidate genes in regions of linkage (the codes are available upon request). Specific procedures for AD are: 1. The region to be analyzed needs to be defined. The input regions are usually defined as the 1 LOD drop of the linkage peak whose physical location is defined by the closest markers to the ends of the regions. 2. A list of disease specific keywords generated from 24 meta-analyzed and confirmed AD genes (http://www.alzgene.org; accessed on July 2, 2007) is provided to the program. These lists are function and process key terms deposited in the Gene Ontology (GO) databases (http://www.geneontology.org; accessed on July 2, 2007) for this disease. GO is developed to capture the activities at cellular and molecular level. For AD, these terms are, in general, related to neurological function, inflammation, oxidative damage, cholesterol/lipoprotein function and atherosclerosis/vascular pathways [30]. As of 2006, there are about 18,455 GO terms assigned to proteins to illustrate what they do, where they function, and what processes they are involved in. The AlzGene (http://www.alzgene.org) with in the Alzforum (http://www.alzforum.org) cat-alogue, "AlzGene database", is a systematic, metaanalyses of potential Alzheimer disease susceptibility genes in at least three case-control samples [22]. This gene database is expected to provide a powerful tool for deciphering the genetics of Alzheimer disease, and serve as a potential model for tracking the most viable candidate genes.
Based on the GO vocabulary terms of the 24 most significant genes identified by Algene, we developed a bioinformatics approach to identify candidate genes in our QTL regions linked to AD. Briefly, we articulate that these GO terms for these 24 candidate liability genes ( Table 2) can serve as a model in discovering new AD genes in each of our linkage peak regions. For each gene, we downloaded two categories of GO terms, function and process, from GO databases and interrogated the linkage regions of interest with theses GO terms for possible candidate AD genes. Genes enriched with the specified AD related GO terms are then downloaded directly from the human genome draft Build 36.1 database (http://genome.ucsc.edu) assembly.

CRI located at 6q27
The chromosome 6q27 interval (140-192 cM) had a peak MLS score of 2.2. We have genotyped an additional ten markers and decreased the MID in this region from 10.9 cM to 3.8 cM. Additionally, the interval of support in the follow-up scan narrows to 148-188 cM ( Table 1). Graph of the MLS scores from the total dataset showed the extra marker information split the peak: one signal located between 140 and 165 cM in the LOAD group with a peak MLS of 0.95 at 153 cM and a second CRI located between 170 and 192 cM in the EM group with a peak MLS of 1.48 at 183 cM ( Fig. 1a).

CRI located at 14q22
The CRI we identified at 14q22 (37-80 cM) in the original scan had the highest MLS in the EM subset of families (MLS = 2.2). This evidence is probably due in large part to the presence of the PSEN1 gene located within this region at 68.7 cM. We have genotyped an additional 7 markers in this region, reducing the MID from 6.7 to 3.0 cM, and increasing the IC from 0.30-0.54 to 0.44-0.65. Additionally, the interval of support in the follow-up scan narrows to 50-85 cM ( Table 1).
The CRI is confirmed in the EM group between 35 and 75 cM with an increase in the peak MLS score from 2.2 to 2.5 at 48 cM (Fig. 1b). This places the peak ∼20 cM proximal to PSEN1, which suggests the location of a possible second susceptibility gene in the EM group.

CRI located at 11q27
Four additional microsatellites were genotyped in the CRI located at 11q25 (116-161 cM), resulting in a decrease in the MID of this region to 5.0 cM from an MID of 9.0 cM, and an increase of IC from 0.39-0.52 to 0.45-0.62. Additionally, the interval of support in the follow-up scan narrows to 120-153 cM ( Table 1). The peak MLS remained unchanged from the original scan (2.0 at 158 cM) for the total set and there was narrowing of the signal to a region between 130 and 164 cM (Fig. 1c).

CRI located at 3p26
The CRI located at 3p26 had an MLS of 2.0 in the original scan in the EM and CM sets. We have genotyped three additional markers in this region, that result an increase in IC from 0.32-0.52 to 0.43-0.65 and a decrease in the MID from 8.8 to 5.3 cM. The peak MLS for the EM set and the combined set decreased to 1.45 and there was no narrowing of the region. Since we wish to maintain our stringent criteria of an MLS 2.0 for confirmation of a CRI and because there is no supporting evidence of the 3p26 region linked to AD in other genomic scans, this signal is most likely a false positive, therefore we have omitted this region from any further investigations.
overlapped. About 83 genes and nearly 8,321 HapMap Caucasian SNPs were recorded at the 11q22 QTL peak interval (116-161 cM) and our bioinformatics procedure reduced the candidate gene lists to 12 genes based on function and 26 based on process, with about 6 genes overlapping.

Discussion
Follow-up linkage analyses utilizing an additional 24 STRs with an average intermarker distance of ∼5 cM, confirmed suggestive linkages in the NIMH families on the CRIs located at 6q27, 14q22, and 11q25. The MLS scores for each CRI increased from the original scan and narrowing of the CRIs is also observed. Because of the likelihood of genetic heterogeneity in this complex disease, the heterogeneity LOD (HLOD) score statistic [31], which allows for linked and unlinked families in the sample, was performed on the original CIDR marker set and the follow-up set of markers in each region. In addition, the information content (IC), calculated in Genehunter Plus, for the CRI at 6q, 14q and 11q increased from a range of 0.30-0.54, 0.39-0.52, 0.32-0.52 in the original scan to 0.44-0.65, 0.45-0.62, 0.43-0.65 in the follow up, respectively. An IC estimate close to 0.7 is the theoretical maximum for sib pair families, which is the predominant structure in the NIMH families [32]. It is also worthwhile to mention that the chr. 14 'peak' is located at a good distance away from the most obvious candidate gene in the region, i.e. PSEN1. Hence, the above data and analyses suggest these regions may harbor additional loci impacting on risk to AD.
The reduced evidence for linkage at 3p26, initially identified as suggestive on the original genomic scan, may be attributed to confounding factors such as genetic or clinical heterogeneity of the disease, difficulty to detect genes of small effect, which reflects the difficulties and challenges that face the genetic mapping of complex traits like AD, or it may be a true false positive.

Classical gene discovery approach
In general, gene discovery for complex traits follows four strategic steps: whole genome linkage or association studies to identify chromosome candidate regions, fine-mapping by linkage or association studies of polymorphisms (Ex. SNPs) to identify the putative causal gene(s), sequence analysis of the pinpointed gene (s) to identify causal variant(s), and finally functional tests of the found variants [33]. However, it is becoming apparent that many, and perhaps most of the regions that show linkage to a phenotype in multiple populations harbor more than one susceptibility locus [34]. For example, different asthma-related phenotypes are known to map to different locations within a broad linkage region [35]. Moreover, it is even likely that additional susceptibility loci reside within the same linkage peak and close to some of the positionally cloned genes, as has been shown for asthma [36]. Therefore, we propose that such regions are more likely to harbor multiple susceptibility loci, each with relatively small to moderate effects rather than large effects, on disease risk.

Limitations of the classical approach
The challenge in following-up linkage signals is to identify the genes that are responsible for the observed linkage results contained in such broad chromosomal regions (30 to 40 cM of recombination or 30-40 millions of DNA bases in length). Fine mapping of the linked region using more closely spaced STRs is limited by the low frequency of recombination events between any two closely spaced points in the genome and the limited number of highly informative STRs available, resulting in only marginal increases in IC.
If a CRI has been confirmed and narrowed with follow-up mapping, the region still may be relatively broad, perhaps 20-30 cM, such that potentially hundreds of genes may be contained within a single CRI. Traditionally, the 1-LOD-drop support interval flanking the peak MLS is used as a guide to define the critical region of an observed linkage peak [31], which could further narrow the region from 5-15 cM. However, a large number of genes may still be located in this narrowed region. Given current technology, an exhaustive analyses of all these genes is possible, however given current resources, comprehensive evaluation of all of the genes in a region of linkage is rarely possible, being both laborious and expensive. For example, the CRIs on chromosomes 6q27, 14q22, and 11q22 that we confirmed and narrowed to intervals of 52 cM, 40 cM, and 31 cM, respectively, still contain 159, 198, and 83 genes, respectively, according to the latest assembly (March 2006, NCBI build 36.1) of the human genome draft. Thus, a bioinformatics approach for identification and prioritization of candidate genes from the number of genes in confirmed CRIs remains a critical step following this classical approach. New approaches such as the current high-density genome-wide association (GWA) arrays also requires deep-sequencing, and if there are several SNPs in LD, then bioinformatics will help prioritize the genes to start deep sequencing on. GWA should be also considered as an initial step in the elucidation of susceptibility variants. There may be several regions that may harbor genetic variants that influence susceptibility to a specific disease. Several of the associated SNPs may fall within and near genes. Fine-mapping studies of these several regions are needed to confidently localize the signals to specific genes. Confirmation of the role of genes in the identified regions of interest will require prioritization of the genes and replication studies in other populations, and, ultimately, functional studies. More over, genome-wide association studies are promising, yet not always economically feasible or statistically desirable [37]. Therefore, one of the greatest challenges in disease association study design remains the intelligent selection of candidate genes.

Selection of candidate genes within CRI
There are potentially hundreds of potential candidate genes which could be investigated in association studies, family-based or case-control, using SNPs and programs such as LAMP [38] can be used to determine whether the entire linkage signal can be explained by a SNP. However, which candidate gene(s) to perform SNP genotyping for association testing is a conundrum. Hence, we must rely on the selection of positional candidate genes in the linkage region that have a known biological function related to our trait or are homologous to other genes in our phenotypic causal pathways. Combining mapping and arraying has been suggested as one approach to reduce the number of genes from QTL regions [39].
In the past few years, several groups have published bioinformatics methods for narrowing the lists of candidate genes (see the review by Tiffin et al. [34]. The fundamental assumption of the scheme for prioritizing candidate genes from linkage studies is that genes involved in or predisposing to a given polygenetic disease tend to share more commonalities in their molecular function, biological process or physiological pathway [40,41] than genes chosen at random or genes not involved in the same disease. Hence, Gene Ontology annotation terms will be enriched among genes linked to the trait [42] and such commonalities are often sufficient to identify these genes from regions containing hundreds of other genes. Based on these assumptions, one way to narrow the list of candidate genes in AD is to search for annotation terms that are enriched among the systematic metaanalyzed AD candidate genes, that have been confirmed by at least three case-control samples and cataloged in the "AlzGene database" as potential Alzheimer disease susceptibility genes (http://www.alzgene.org) relative to randomly sampled AD genes. Using these more biologically plausible GO function and process terms, we were able to reduce the number of positional candidate genes within each critical region (within the 1-LOD intervals) defined above. The list of genes can be initially prioritized based upon the number of GO terms assigned to each gene. Nevertheless, a large number of genes in the human genome are uncharacterized or poorly characterized, and the annotation of genes in Gene Ontology (GO) databases are incomplete and biased toward highly studied genes; thus, novel or poorly characterized genes could be missed. Moreover, GO is one approach and component to a more comprehensive systems biology approach. Additional sources of information such as gene expression, protein-protein interaction networks, tissue specificity, KEGG or BioCarta pathways, and sequence homology, could be integrated into a combined statistic, and biological confirmation of the candidate genes should be performed. Although the approach here still suffers from the bias introduced by pathways selected by biased researchers [30,43,44], this bias will be reduced once more genome wide association studies are incorporated into the AlzGene database, and gene annotation becomes more comprehensive. Finally, each investigator can then develop bioinformatics methods for further narrowing or prioritizing the list of candidate genes using a variety of selection variables such as expression profiles in neuronal tissue, the number of GO terms for each gene, evolutionary conservation, and patterns of gene duplication [33], and other systems biology approaches to determine which to interrogate first.

Conclusion
In 2003, we reported the results of a whole genome AD linkage study in the NIMH AD Genetics Initiative families [1]. In this study, we performed a follow up linkage study by genotyping 24 additional STR markers in the same sample of 437 families. The chromosome 6q27, 14q22, 11q27 and 3p26 CRIs were statistically analyzed under the same dominant inheritance model, resulting in confirmatory evidence for the CRIs at 6q27, 14q22, and 11q27. We were able to reduce the total number of genes in these regions to a list of plausible candidate genes using function and process GO terms derived from 24 meta-analyzed and confirmed AD genes on the Alzgene website. The approach, using bioinofrmatics tool and databases like Alzgene will facilitate in the identification and prioritization of candidate genes after the classical follow-up of promising linkage region and for deep-sequencing of several SNPs that are in LD in the current high-density genome-wide association array platforms.

Outlook of the experiment
Due to the evidence of linkage and the consistency of signals, we will be pursuing chromosome regions 6q27, 14q22, and 11q25 for possible candidate genes. We plan to first, undertake detailed fine mapping of these CRIs using dense SNPs selected from HapMap to narrow down the CRIs through linkage disequilibrium (LD) mapping. We will include in this phase the interrogation of candidate genes using tag SNPs from HapMap. We plan to priortize these genes based on the number of GO terms for each, and expression in neuronal tissue and determine which to interrogate first. Selection of SNPs in candidate genes will be based upon the location (coding/promoter vs. non-coding SNPs), type (if coding) (nonsynonymous vs. synonymous), as well as comparison with other species [45,46] to identify highly conserved variants. Finally, sequencing within critical gene regions will need to be done in a systematic way to identify gene variants that may predispose to AD. These can then be confirmed in population-based studies, with further studies in animal models or other in vivo methods to ascertain its function in the disease process.