GPA-MDS: A Visualization Approach to Investigate Genetic Architecture among Phenotypes Using GWAS Results

Genome-wide association studies (GWAS) have identified tens of thousands of genetic variants associated with hundreds of phenotypes and diseases, which have provided clinical and medical benefits to patients with novel biomarkers and therapeutic targets. Recently, there has been accumulating evidence suggesting that different complex traits share a common risk basis, namely, pleiotropy. Previously, a statistical method, namely, GPA (Genetic analysis incorporating Pleiotropy and Annotation), was developed to improve identification of risk variants and to investigate pleiotropic structure through a joint analysis of multiple GWAS datasets. While GPA provides a statistically rigorous framework to evaluate pleiotropy between phenotypes, it is still not trivial to investigate genetic relationships among a large number of phenotypes using the GPA framework. In order to address this challenge, in this paper, we propose a novel approach, GPA-MDS, to visualize genetic relationships among phenotypes using the GPA algorithm and multidimensional scaling (MDS). This tool will help researchers to investigate common etiology among diseases, which can potentially lead to development of common treatments across diseases. We evaluate the proposed GPA-MDS framework using a simulation study and apply it to jointly analyze GWAS datasets examining 18 unique phenotypes, which helps reveal the shared genetic architecture of these phenotypes.


Introduction
Genome-wide association studies (GWAS) have been conducted to study the genetic basis of complex human traits. As of August 2015, more than 15,000 single nucleotide polymorphisms (SNPs) have been reported to be significantly associated with at least one complex trait (the NHGRI-EBI catalog of published GWAS [1], https://www.ebi.ac.uk/gwas/). "Pleiotropy," that is, the sharing of genetic factors among complex traits, is well documented and a systematic analysis of the GWAS catalog of published GWAS showed that 17% of the reported genes are associated with multiple traits [2]. For example, genetic studies for five psychiatric disorders suggested a very strong genetic correlation between schizophrenia and bipolar disorder [3,4]. Pleiotropy has also been demonstrated among several other types of traits, such as cancers [5].
In order to leverage pleiotropy between complex traits and effectively integrate multiple GWAS datasets, Chung et al. [6] developed a unified statistical framework, named GPA (Genetic analysis incorporating Pleiotropy and Annotation), which provides statistically rigorous and biologically interpretable inference tools for genetic studies. Application of GPA to five psychiatric disorder GWAS datasets from the Psychiatric Genomics Consortium [3,4] showed that GPA can accurately identify pleiotropic structure among these diseases [6]. While the GPA framework provides a statistically rigorous framework to evaluate pleiotropy, it still remains limited to a small number of phenotypes and it is common to consider a joint analysis of only two phenotypes mainly  Figure 1: GPA-MDS framework. In this framework, by taking the association -value for each SNP from each GWAS study as an input, we first calculate a distance matrix between phenotypes using the pleiotropy hypothesis testing procedures in the GPA framework. Then, we generate a plot depicting a global picture of genetic relationship among phenotypes by projecting phenotypes onto two-dimensional space using the multidimensional scaling (MDS) algorithm based on this distance matrix.
due to computational efficiency and estimation stability and robustness. In practice, we are interested in jointly studying larger numbers of phenotypes; however it is not a trivial task to investigate and integrate results from multiple pairs of phenotypes.
In order to address this challenge, in this paper, we propose a novel visualization approach, GPA-MDS, to investigate genetic architecture with a joint analysis of multiple GWAS datasets using the GPA algorithm and the multidimensional scaling (MDS) approach. Specifically, the GPA algorithm allows for evaluation of pleiotropy between two phenotypes within a statistically rigorous framework. Then, the MDS approach effectively integrates these results for a large number of phenotypes and provides a two-dimensional map of genetic architecture. This paper is organized as follows. In Section 2, we review the GPA and MDS algorithms and propose the GPA-MDS approach. In Section 3, we evaluate the proposed method with a simulation study and apply it to a joint analysis of 18 GWAS datasets. Finally, in Section 4, we will discuss future research directions. Figure 1 shows a diagram of overall workflow for the GPA-MDS framework. Here, by taking association -value for each SNP from each GWAS dataset as an input, we first calculate a distance matrix between phenotypes using the GPA framework, as described in detail in Section 2.1. Then, we generate a plot of genetic relationship among phenotypes using the multidimensional scaling (MDS) algorithm, as illustrated in Section 2.2.

Statistical Inference of Pleiotropy Using the GPA Algorithm.
In this section, we review the GPA framework [6] for the case of a joint analysis of two GWAS datasets. Although there is no limitation in the number of GWAS datasets that can be jointly analyzed in the GPA framework, a joint analysis of two GWAS datasets is often appropriate in the sense of the computational efficiency and estimation stability and robustness. Let be the index for SNPs and let be the index for GWAS datasets. Suppose that we have performed hypothesis testing of genome-wide SNPs for two GWAS and obtained their -values. Specifically, for GWAS 1 , we have Null hypothesis for GWAS 1 : (11) 0 , (21) 0 , . . . , ( 1) 0 , . . . , where is the number of SNPs and denotes -value of the th SNP in the th GWAS. Similarly, for GWAS 2 , we have Null hypothesis for GWAS 2 : (12) 0 (2) Let us denote P 1 = ( 11 , 21 , . . . , 1 , . . . , 1 ) and P 2 = ( 12 , 22 , . . . , 2 , . . . , 2 ). We introduce latent variables Z = [ 00 , 10 , 01 , 11 ] indicating the association between the th SNP and the two phenotypes: 00 = 1 means that the th SNP is not associated with any phenotypes, 10 = 1 means that it is only associated with the first one, 01 = 1 means that it is only associated with the second one, and 11 = 1 means that it is associated with both. We assume that 00 , 10 , 01 , 11 ∈ {0, 1} and 00 + 10 + 01 + 11 = 1 because a SNP can only be International Journal of Genomics 3 one of these states. Given these latent variables, we assume the following emission distributions: 00 = Pr ( 00 = 1) : 10 = Pr ( 10 = 1) : 11 = Pr ( 11 = 1) : where 0 < < 1, = 1,2. We put the constraint 0 < < 1 to model that a smaller -value is more likely than a larger -value when it is from the nonnull group [7]. Parameters in the GPA model can be estimated using the Expectation-Maximization (EM) algorithm [8], which is remarkably computationally efficient because we have explicit solutions for all the parameters in the -step. More details about the GPA model, the EM algorithm, and the estimation of standard errors can be found in [6]. Given the GPA model and its estimated parameters, we can determine association of th SNP with phenotypes based on their local false discovery rate (FDR) [9]. Specifically, the local FDR for association of th SNP with each of the first and second phenotypes is defined as Similarly, the local FDR for association of th SNP with both phenotypes is defined as Then, we use the direct posterior approach [10] to control the global FDR. Specifically, we first sort SNPs by their local FDR from the smallest one to the largest one. Denote these sorted local FDR for th SNP by . We increase the threshold for local FDR, , from zero to one until where is the predetermined bound of global FDR and 1{⋅} is an indicator function with value of one if the statement is true and of zero otherwise. Finally, we determine SNPs with corresponding ≤ to be associated with the phenotype. Now, consider testing pleiotropy between two phenotypes. When the genetic bases of the two phenotypes are independent of each other (i.e., no pleiotropy), then we expect that 11 = ( 10 + 11 )( 01 + 11 ). Therefore, the difference between 11 and ( 10 + 11 )( 01 + 11 ) can be used to characterize pleiotropy. Hence, testing pleiotropy can be formulated with the following hypothesis: where 1 * = 10 + 11 and * 1 = 01 + 11 . The likelihood ratio test (LRT) statistic can be constructed as follows: whereΘ 0 represents the parameter estimates obtained under the null hypothesis of pleiotropy test. The test statistic (−2 log ) asymptotically follows 2 -distribution with degree of freedom of one, under the null hypothesis. Fitting of the GPA model and hypothesis testing of pleiotropy were implemented as a part of the R package "GPA," which is currently available in its GitHub page (http://dongjunchung.github.io/ GPA/).

Visualization of Pleiotropic Structure Using Multidimensional Scaling.
For the visualization of genetic relationships among phenotypes, we first run pleiotropy tests for all possible pairs of GWAS datasets and generate a matrix of their log 10 -transformed -values, denoted as . Then, we define a distance between th and th phenotypes as = − 2 min < { }. Note that this definition of distance assigns a shorter distance to a pair of phenotypes with smaller pleiotropy test -values, while it also allows avoiding negative distance values. Then, we feed this distance matrix to the MDS algorithm and project phenotypes onto the two-dimensional space. As a result, MDS essentially clusters phenotypes based on their genetic similarities; that is, phenotypes sharing fewer SNPs are located further apart on the two-dimensional space compared to phenotypes sharing more SNPs. Specifically, given a distance matrix = ( ), MDS seeks to find 1 , . . . , ∈ such that by minimizing the following objective function: Here, we consider = 2 to provide easily understandable visualization. We used the function cmdscale() in R with default settings to implement MDS.

Simulation Study.
We conducted a simulation study to evaluate the performance of GPA-MDS approach. Here, we assume that there are five GWAS datasets, each of which is profiled for a set of 10,000 SNPs common to all 5 datasets. Among these 10,000 SNPs, 2,000 SNPs (20%) were assumed to be risk SNPs for each phenotype. In order to generate pleiotropic structure, we set 75% of the risk SNPs to be shared between phenotypes 1 and 2 and also between phenotypes 3 and 4, while no risk SNPs are shared between phenotypes 1/2 and phenotypes 3/4. Phenotype 5 did not share any risk SNPs with any other phenotypes as a negative control 4 International Journal of Genomics Each box indicates a GWAS study and -axis represents the SNP index. The gray box within each box indicates risk SNPs. In this study, we considered five phenotypes and 20% of the SNPs were assumed to be risk SNPs for each phenotype. We further assumed that 75% of risk SNPs were shared between phenotypes 1 and 2 and also between phenotypes 3 and 4 to generate pleiotropic structure. Phenotype 5 did not share any risk SNPs with any other phenotypes as a negative control.  ( Figure 2). Finally, for each phenotype, we simulated -values for nonrisk SNPs from a uniform distribution and -values for risk SNPs from a Beta distribution with parameters 0.4 and 1. Figure 3 shows the GPA-MDS plot for these five phenotypes. In this plot, phenotypes 1 and 2 are clustered and phenotypes 3 and 4 generate another cluster. Phenotype 5 is isolated and located away from these two phenotype clusters. This result shows that the proposed GPA-MDS approach can provide easily interpretable visualization revealing the pleiotropic architecture among phenotypes.

Real Data Analysis.
We applied the proposed GPA-MDS approach to the GWAS datasets of 18 phenotypes, using summary statistics, which are publicly available from consortium websites. Specifically, we considered (1) attention deficit/hyperactivity disorder (ADHD), autism spectrum disorder (ASD), bipolar disorder (BPD), major depressive disorder (MDD), and schizophrenia (SCZ) from the Psychiatric Genomics Consortium (http://www.med.unc.edu/ pgc); (2) Crohn's disease (CD) and ulcerative colitis (UC) from the International Inflammatory Bowel Disease Genetics Consortium (https://www.ibdgenetics.org/); (3) rheumatoid arthritis (RA) (https://www.broadinstitute.org/ftp/pub/rheumatoid arthritis/Stahl etal 2010NG/); (4) high-density lipoprotein (HDL), low-density lipoprotein (HDL), triglycerides (TG), and total cholesterol (TC) from the Global Lipids Consortium (http://csg.sph.umich.edu//abecasis/public/lip-ids2010/); (5) type 2 diabetes (T2D) from the DIAbetes Genetics Replication And Meta-analysis Consortium (http:// diagram-consortium.org/); (6) coronary artery disease (CAD) from the CARDIoGRAM Consortium (http://www .cardiogramplusc4d.org/data-downloads/); (7) systolic blood pressure (SBP) and diastolic blood pressure (DBP) from the International Consortium for Blood Pressure (http://www .georgehretlab.org/icbp 088023401234-9812599.html); and (8) fasting glucose (FG) and log of fasting insulin (LFI) from the MAGIC Consortium. We used the intersection of SNPs among these datasets, which consists of 228,944 SNPs. Figure 4 shows the GPA-MDS plot for the 18 phenotypes. We can see that clinically related phenotypes are tightly International Journal of Genomics 5   ADHD  ASD  BPD  MDD  SCZ  CD  UC  RA  HDL  LDL  TG  TC  T2D  CAD  SBP  DBP  FG  LFI   ADHD  ASD  BPD  MDD  SCZ  CD  UC  RA  HDL  LDL  TG  TC  T2D  CAD  clustered in this plot. For example, all the neuropsychiatric disorders (ADHD, ASD, BPD, MDD, and SCZ) generate a cluster, inflammatory bowel diseases (UC and CD) make a cluster, lipid-related phenotypes (HDL, LDL, TC, and TG) cluster together, blood pressure phenotypes (SBP and DBP) cluster, and so on. Moreover, RA is also located relatively close to UC and CD, which is consistent with the literature as RA, UC, and CD are all autoimmune diseases [11]. The cluster containing both T2D and CAD in this plot is also well supported by prior studies, which suggest the pleiotropy between T2D and CAD [12][13][14]. These results show the potential of the proposed GPA-MDS approach for the investigation of pleiotropic architecture, which can be used to promote understanding of common etiology and development of joint treatment of diseases. In order to further understand the phenotype mapping provided by GPA-MDS, we checked the number of risk SNPs shared among phenotypes ( Figure 5). Here, "risk SNPs" were determined using the GPA algorithm by controlling the global FDR at 0.1. We can see that some of the phenotypes that are closely located in the GPA-MDS plot actually share more risk SNPs, as in the case of CD-UC, HDL-LDL-TG-TC, and SBP-DBP. However, it might look like that Figure 5 seems to contradict the GPA-MDS plot for other phenotypes. For example, although ADHD and ASD are located really close to each other in the GPA-MDS plot, it seems that only a few risk SNPs are shared between these two phenotypes. Such "discrepancy" happens because GPA evaluates pleiotropy by checking whether the number of shared risk SNPs ( 11 ) is significantly higher than what is expected by chance (( 10 + 11 )( 01 + 11 )), not simply based on the counts of shared risk SNPs ( 11 ).  In order to confirm this explanation and further evaluate the utility of employing the GPA algorithm, we generated a MDS plot of phenotypes, where the distance between two phenotypes was determined solely by the number of shared risk SNPs ( Figure 6). Specifically, we generated the MDS plot by defining the distance between th and th phenotypes as = 2 max < { } − , where is the log 10 -transformed risk SNPs shared between th and th phenotypes. In this mapping of phenotypes, clinically related phenotypes failed to cluster together but instead multiple phenotype groups are mixed together. Furthermore, we can see that the mapping is essentially driven by a few pairs of phenotypes which share large numbers of genotypes, such as CD-UC, TC-LDL, and SBP-DBP. Hence, we can conclude that it was actually critical to utilize the GPA algorithm to evaluate pleiotropy in the first step of our visualization framework because it provides biologically more meaningful visualization of genetic relationship among phenotypes.

Conclusion
In this paper, we proposed a novel visualization approach for genetic architecture among phenotypes, namely, GPA-MDS, using the GPA algorithm and the multidimensional scaling (MDS) approach. While the GPA framework provides a rigorous evaluation of pleiotropy between a pair of phenotypes, the MDS approach extends this investigation to larger number of phenotypes in a computationally efficient way. The application of GPA-MDS to the genetic studies of 18 phenotypes revealed patterns of shared genetic architecture among phenotypes, underscoring the potential of 6 International Journal of Genomics the proposed method to investigate genetic sharing among complex traits. We note that when the proposed GPA-MDS framework is used, it is critical to confirm that its assumptions hold well for the input GWAS dataset. Specifically, because GPA-MDS uses GPA as its first step, users need to confirm that GWAS association -values provided to GPA-MDS satisfy the GPA assumptions, for example, uniformity of null -values. For example, if population stratification and cryptic relatedness have not been accounted for in previous GWAS studies, -values of null SNPs may not follow the assumed uniform distribution and can result in biased MDS visualization results. Hence, these confounding effects should be checked carefully and addressed before applying the GPA-MDS framework to these GWAS association -values. We recommend readers to check [6] for deeper discussion of the GPA assumptions.
Currently, we are working on the following directions that can further improve the GPA-MDS framework. First, in this paper, we used a definition of distance based on the logarithm transformation of -values. While this approach is intuitive and works well in practice, other choices of distance measures might change visualization results. Hence, it is of great interest to investigate other choices of distance measures and their impacts on visualization results. Second, while the proposed GPA-MDS approach promotes global understanding of genetic architecture, it is still laborious to pinpoint risk SNPs leading to phenotype clusters when we investigate a large number of phenotypes. Hence, it would be desirable to automate the procedure to identify overlapping risk SNPs. We expect that GPA-MDS will be a useful method in elucidating the pleiotropic architecture of complex traits, which can contribute to a better understanding of shared genetic mechanisms and the development of improved diagnosis and therapeutics.