Genes Associated with SLE Are Targets of Recent Positive Selection

The reasons for the ethnic disparities in the prevalence of systemic lupus erythematosus (SLE) and the relative high frequency of SLE risk alleles in the population are not fully understood. Population genetic factors such as natural selection alter allele frequencies over generations and may help explain the persistence of such common risk variants in the population and the differential risk of SLE. In order to better understand the genetic basis of SLE that might be due to natural selection, a total of 74 genomic regions with compelling evidence for association with SLE were tested for evidence of recent positive selection in the HapMap and HGDP populations, using population differentiation, allele frequency, and haplotype-based tests. Consistent signs of positive selection across different studies and statistical methods were observed at several SLE-associated loci, including PTPN22, TNFSF4, TET3-DGUOK, TNIP1, UHRF1BP1, BLK, and ITGAM genes. This study is the first to evaluate and report that several SLE-associated regions show signs of positive natural selection. These results provide corroborating evidence in support of recent positive selection as one mechanism underlying the elevated population frequency of SLE risk loci and supports future research that integrates signals of natural selection to help identify functional SLE risk alleles.


Introduction
Systemic lupus erythematosus (SLE) is an autoimmune disease whose prevalence, incidence, and disease severity are known to vary among ethnic groups. Increased prevalence has been reported among African-Americans, Asians, Hispanics, and Native Americans (reviewed elsewhere [1,2]). The reasons for the ethnic disparities remain elusive. According to the "hygiene hypothesis" first proposed by Strachan two decades ago [3], the increased disease prevalence of autoimmune and allergic diseases in industrialized countries may be due to modern society's limited pathogen exposure. The Hygiene Hypothesis posits that humans have adapted to infectious exposures that were the norm in the past and that exposure was protective against autoimmune disease. Over many generations environmental pressure may have favored alleles that allow humans to respond to immune system challenges differently but resulted in an increased risk of autoimmune diseases. This could be a mechanism explaining the number of SLE risk alleles that are common in the population.
Human genome variation at the population level is shaped by four evolutionary processes: mutation, migration, random genetic drift, and natural selection. Natural selection is the process by which a trait, in the context of the organism's environment, becomes either more or less common in a population as a function of the effect of the inherited trait on the differential reproductive success. This ability to survive and reproduce and contribute to the gene pool of the next generation is known as fitness. Natural selection drives adaptation, the evolutionary process whereby over generations the members of a population become better suited to survive and reproduce in that environment. While negative selection decreases the prevalence of traits that diminish individuals' fitness, positive selection increases the prevalence of adaptive traits. Left untreated, SLE would have a reproductive fitness 2 Autoimmune Diseases cost, defined as the ability to raise offspring that successfully reproduce. Thus, some evolutionary process must sustain the relative high frequency of SLE risk alleles seen in current populations around the world. We hypothesize that since the human genome is shaped by adaptation to environmental pressures at the population level, one plausible reason for the higher frequency of disease-risk alleles may be the direct effect of population-specific positive natural selection.
There is compelling evidence that natural selection is acting on a significant fraction of all genes (∼3%) [4][5][6][7] and as much as 10% of the human genome [8]. Multiple studies have identified genes involved in immune-related functions to be under selection [8][9][10], including the HLA [11][12][13][14] (associated with all autoimmune diseases), BTLA [10] (associated with rheumatoid arthritis), ITPR3 [10] (SLE, type 1 diabetes, Grave's disease), PTPN22 [10] (rheumatoid arthritis, Crohn's disease, type 1 diabetes, vitiligo), ITGAX [10] (SLE), and BLK [10] (SLE, rheumatoid arthritis, Kawasaki disease). Finally, we have recently provided evidence that variants within the APOL1 gene known to be under selective pressure in some African populations predispose to end-stage kidney disease in SLE [15]. Given the increasing evidence of selection at loci associated with human autoimmune diseases, identification of alleles under selection may provide further insight into SLE susceptibility and help understand the natural history of SLE predisposition.

Methods
A list of genetic regions with compelling evidence of association with SLE was compiled from the literature. This list includes results that met genome-wide significance in any genome-wide association study (GWAS) or transethnic study of SLE and common or rare variants that are considered established SLE-predisposing loci from candidate gene and other studies. The list of regions was based on the literature as of August 2013 and comprises 89 genes in 74 genomic regions.
This list was built upon all the SLE-associated regions described in recent reviews [16][17][18][19], which include common and rare variants from candidate gene studies with compelling evidence of association with SLE. We included all reported risk variants for SLE using data from the National Human Genome Research Institute's Catalog of Published GWAS (http://www.genome.gov/gwastudies) accessed on August 30th, 2013 [20]. Finally, we searched PubMed (http:// www.ncbi.nlm.nih.gov/pubmed) for all large-scale transethnic or multiracial studies in SLE and catalogued all variants with a reported meta-analysis value < 5 × 10 −7 . The references for these more recent studies are included in Table 1. Given the paucity of studies conducted in some minority populations, and in order to avoid differential bias due to the number of reported associations in different ethnic groups, we chose to include all variation regardless of the population(s) where they were reported and ignore the information about the population(s) where they have been reported to date.
Assuming no other influencing factors, the advantageous alleles at a locus under positive selective pressure will tend to stochastically increase in prevalence over generations. This can lead to allele frequency differences between populations, which can be detected using statistics that compare the genetic variability within and between populations [69]. It can also lead to the haplotype carrying the advantageous allele to remain longer than genetic distance predicts around alleles of equal frequency, which can be measured using haplotype-based statistics [7]. The evidence of selection in each SLE-associated region was analyzed using both population differentiation, allele frequency spectrum, and haplotype-based statistics in the HapMap II and HGDP populations as implemented in the Haplotter (http://haplotter. uchicago.edu/) [7] and the Human Genome Diversity Project (HGDP) Selection Browsers (http://hgdp.uchicago.edu/cgibin/gbrowse/HGDP/) [70], respectively.
Haplotter displays the results of a scan for positive selection in the human genome using the International HapMap Project data (http://haplotter.uchicago.edu/) [7]. These data consist of ∼800,000 polymorphic SNPs in three distinct population samples of unrelated individuals: 89 Japanese and Han Chinese individuals from Tokyo and Beijing, respectively, denoted as East Asian (ASN), 60 individuals of northern and western European origin (CEU), and 60 Yoruba (YRI) from Ibadan, Nigeria. It shows results on the autosomes only. Results from several selection statistics are displayed, including (1) the fixation index ( ST ), (2) the Tajima's , and (3) the integrated haplotype score (iHS). In situations where selection is restricted to certain populations or geographical locations, the allele frequencies at the locus that is undergoing selection may vary significantly between different populations. The fixation index ST provides a metric of the magnitude of global allele frequency differentiation between populations at a locus [69,71]. ST is directly related to the variance in allele frequency among populations and, conversely, to the degree of resemblance among individuals within populations. If ST is small, it means that the allele frequencies within each population are similar; if it is large, it means that the allele frequencies are different [72]. The Tajima's is based on the frequencies of the polymorphisms segregating in a locus [73]. As described [7], positive selection results in an excess of high frequency derived alleles compared to neutral expectations when the selected allele has swept to high frequencies. Positive selection also results in an excess of low frequency polymorphisms, especially when the selected allele is close to fixation or right after fixation. This skewing of SNP frequencies in different directions can be detected by Tajima's , which is based on the frequencies of SNPs segregating in the region of interest [73]. Signals of selective sweeps will result in high negative . The integrated haplotype score (iHS) uses the lengths of the haplotypes surrounding each core SNP to identify SNPs for which alleles have rapidly risen in frequency [7,74]. It is based on linkage disequilibrium (LD) surrounding a positively selected allele compared with background, providing evidence of recent positive selection at a locus [7]. An iHS score > 2.0 reflects the fact that haplotypes on the ancestral background are longer compared with those on the derived allelic background.
For these analyses, genome-wide SNP data from Phase II of the HapMap Project were used to investigate if the regions associated with SLE showed evidence of selection in the CEU, YRI, and ASN populations using these three metrics (iHS, Tajima's , and ST ). Regions of 1 Mb around each of the 74 regions in Table 1 were queried, and, when higher than 2, the maximum value on the -axis (− log( )) in this 1 Mb interval was recorded. As described by Voight et al. [7], the − log( ) value represents the negative log of the rank of the observed statistic for a given SNP divided by the total number of SNPs. The statistic that is ranked is obtained independently for each of the three statistics separately for each population. For , the estimated value of was used for ranking. For iHS, for each SNP, 25 SNPs on either side of the SNP are scanned for |iHS| > 2. The proportion of SNPs in this 51 SNP window with |iHS| > 2 is computed. For ST , the statistic to be ranked is obtained in a similar manner as that for iHS except for each population comparison, the thresholds for defining a significant ST is based on the top 5% cutoff for each population comparison. The different thresholds used for ST were CEU-YRI: 0.2976, CEU-ASN: 0.2055, and YRI-ASN: 0.3374. Haplotter also displays the ST value of the SNPs in the top 1% within each population comparison, which were also recorded, if any such SNPs were present in the 1 Mb interval. In addition to these, Haplotter shows an empirical value estimated for each gene and for each population, as detailed by Voight et al. [7]. When this value showed significant evidence for selection, the value was recorded.
The HGDP Selection Browser displays results from a series of genome-wide scans for natural selection using single nucleotide polymorphism (SNP) genotype data from the Human Genome Diversity-CEPH Panel (HGDP), a dataset containing 938 individuals from 53 populations typed on the Illumina 650Y platform (http://hgdp.uchicago.edu/cgi-bin/ gbrowse/HGDP/) [70]. Summary statistics regarding haplotype structure and population differentiation on this data can be queried in the browser. These include the iHS, the ST , and the cross-population extended haplotype homozygosity test (XP-EHH) [74]. While the iHS detects partial selective sweeps of moderate frequency (∼50%-80%), the XP-EHH detects selected alleles that have risen to near fixation in one population (above 80% frequency) [7,74]. As described by Pickrell et al. [70], the ST was calculated on the level of population groupings identified by Rosenberg et al. [75]; that is, if a SNP has high ST , most of the variance in allele frequencies is captured by the seven labels identified in that paper. In the browser, plotted is the − log 10 of the empirical value for each SNP-the higher this plotted − log 10 value, the more extreme (high) the ST value is compared the rest of the genotyped SNPs. The iHS was calculated as in Voight et al. [7] and smoothed across windows. Plotted is the −log 10 of the value for a window centered at the SNP; high values again indicate potential signals of positive selection. The test statistic was the fraction of SNPs with |iHS| > 2. The XP-EHH was calculated as in Sabeti et al. 's work [74]. The test statistic was the maximum XP-EHH. Again, the plotted measure is a measure of how extreme a SNP is with regard to the rest of the genome, and high values indicate outliers potentially due to the action of natural selection. The iHS and XP-EHH have been calculated in each individual population, as well as in the following groupings: Bantu-speaking populations, Europeans, Middle Easterners, Central Asians, East Asians, Americans, and Oceanians.
Regions of 1 Mb around each of the 74 regions in Table 1 were queried, and the maximum value on the -axis (− log( )) in this 1 Mb interval was recorded.

Results
To test whether SLE susceptibility loci show evidence of positive selection, a list of 74 genetic regions with compelling evidence of association with SLE was compiled (Table 1). In order to test whether SLE-associated loci show evidence for recent positive selection, 1 Mb regions around each of the 74 regions were queried. Regions where the maximum − log( ) > 3 (for Haplotter) or − log( ) > 3 (for HGDP) for the ST , , iHS, or XP-EHH were considered as showing evidence for recent positive selection (Tables 2 and 3). In addition, regions that in the HapMap populations had SNPs with ST values in the top 1% within each population comparison, or whose empirical value estimated for each gene and for each population showed significant evidence for selection ( value < 0.001) were also considered to show evidence for selection. Of the 74 regions associated with SLE, 19 showed evidence of selection in a HapMap population (Table 2), and 16 exhibited a signal of selection in a HGDP population ( Table 3). Many of these loci also had corroborating evidence using different metrics.
In the HapMap data multiple regions displayed evidence of population differentiation, as indicated by the ST YRI versus ASN). The highest allele frequency differences, as indicated by the statistic, were detected in the PTPN22, IFIH1, ITPR3, and XKR6-BLK regions. The ITPR3 region also had a high iHS. This and BLK are the regions that displayed the most consistently strong evidence for selection according to all three metrics. The ITPR3 gene lies at 6p21, adjacent to the centromeric end of the extended MHC region, after the class II flanking region. XKR6 and BLK lie on the same chromosomal inversion at 8p23.1. PTPN22, ITPR3, and CD226 exhibited the strongest evidence for selection according to the frequencybased statistics. Finally, several regions included genes whose empirical value showed significant evidence for selection. These genes included XKR6 ( = 0.004 in ASN) and UHRF1BP1 ( = 0.006 in CEU). Other genes were significant in several regions, such as the TET3-DGUOK region (DUSP11 and STAMBP with = 0.005 and = 0.007, resp., in CEU). The PTPN22, ITGAX (near ITGAM), ITPR3, and BLK regions were recently reported to be under selection (in YRI, YRI, YRI, and ASN, resp.) in a candidate gene study by Grossman et al. [10], who used full-genome sequence variation from the 1000 Genomes Project and the composite of multiple signals (CMS) test.
Since the regions in Table 2 showed evidence of selection in the HapMap samples, the evidence centered at the specific SNP associated with SLE were tested (Supplementary Table 1   within each population comparison, or the empirical value estimated for the SLE-associated gene and for each population showed significant evidence for selection ( value < 0.01). Cells that did not meet these thresholds or whose − log( ) > 2 are marked with (-). The table shows the highest − log( ) value and respective population for the iHS, , and ST , the ST statistic (value) for SNPs in the top 1% and the population comparison, and the minimum empirical value in each region. is the rank of the observed statistic for a given SNP divided by the total number of SNPs. The statistic that is ranked is obtained independently for each of the three statistics separately for each population. For iHS, for each SNP, 25 SNPs on either side of the SNP are scanned for |iHS| > 2. The proportion of SNPs in this 51 SNP window with |iHS| > 2 is computed. For , the estimated value of was used for ranking. For ST , the statistic to be ranked is obtained in a similar manner as that for iHS except for each population comparison, the thresholds for defining a significant ST    (rs6677604, iHS = −2.30 in YRI), UHRF1BP1 (rs11755393, iHS = −2.28 in CEU), and CD226 (rs727088, iHS = 2.14 in CEU). The evidence for selection at the UHRF1BP1 variant was recently reported in a study of candidate inflammatorydisease SNPs using the same statistic and HapMap II data [76].
In the HGDP data, the highest XP-EHH was detected in the BLK, CLEC16A, and IRF8 regions and the maximum iHS in the CLEC16A and PTTG1 regions. The CLEC16A, BLK, PTPN22, and UHRF1BP1 regions showed strong evidence for selection under the haplotype-based statistics. TNFSF4, IL10, and BLK were the regions showing the highest degree of population differentiation. The TNFSF4 and BLK regions showed the strongest most consistent evidence of selection according to all three metrics. Using the same HapMap II data, Raj and colleagues [76] previously reported SNPs with a significant signal of selection in CLEC16A (rs12708716, iHS = 2.29 in CEU) and UHRF1BP1 (rs11755393, iHS = −2.28 in CEU). As mentioned, the BLK and ITGAX-ITGAM regions were recently reported to be under selection (in ASN and YRI, resp.) in a candidate genes study using the 1000 Genomes Project samples [10]. For the genes in Table 2, an inspection of the worldwide distribution of allele frequencies for the SNPs associated with SLE (Supplementary Table 2) revealed interesting patterns for SNPs in BLK, ITGAM, and CLEC16A ( Figure 1).
Comparing the results of the tests for selection in the HapMap and the HGDP samples shows that there are seven genetic regions captured by at least one test in both datasets ( Table 4). The common regions captured by the majority of tests were that of the PTPN22, UHRF1BP1, and BLK genes. While the region of the TNIP1 gene was captured in both the HapMap and HGDP populations by the frequency spectrum and population differentiation statistics ( and ST ), the region of the UHRF1BP1 gene was captured by the haplotypebased statistics. The evidence for selection in these seven genetic regions (Table 4) is strengthened by the fact that they show consistent evidence across different studies and analytic methods.

Discussion
The diversity exhibited in the human genome is a result of stochastic population genetics processes such as mutation, migration, drift, and selection. SLE disproportionately affects women of child bearing age and without treatment would tend to put affected individuals at a reproductive disadvantage; here, reproductive disadvantage not only includes conception but the ability to raise offspring that successfully reproduce. Thus, strong alternative forces or changing selective pressure must exist that permits the relative high frequency of these risk alleles seen in current populations around the world. Infectious diseases and pathogenic exposures have been postulated to be important factors resulting in strong selective pressure and might provide such alternative pressures. This study investigated whether SLE susceptibility loci show signs of recent positive selection by comparing these regions to the background distribution of genetic variation.
Two important studies have computed several genomewide tests for selection in two main reference populations, the HapMap and the HGDP populations [7,70], and implemented the results in genetic browsers. These browsers were queried to assess whether SLE-associated genetic regions have shown evidence for selection in the HapMap and HGDP populations.
This study reports several SLE-associated loci that show evidence for selection in the HapMap populations, and several SLE-associated loci that show evidence for selection in the HGDP populations. Seven genetic regions showed evidence for selection on both the HapMap and HGDP populations. These include the regions of the PTPN22, TNFSF4, TET3-DGUOK, TNIP1, UHRF1BP1, BLK, and ITGAM genes. In addition to the regions that are concordant, the different results obtained with the different metrics and datasets are expected, mostly due to the different coverage of the SNP arrays used, local adaptation in different ethnic groups, and the different test statistics which are likely recovering selective events from different time periods and for different stages of the selective sweep [77].
Several of these genes have been previously reported to show patterns of genetic variation that are consistent with evidence for recent positive selection. For example, in their search for inflammatory-disease SNPs that localize to regions of the genome where patterns of genetic variation are consistent with that expected under a model of recent positive selection, Raj   in CLEC16A and UHRF1BP1 that exhibit a significant signal of selection using the iHS test. Furthermore, they show that the SLE susceptibility allele in UHRF1BP1 is associated with decreased UHRF1BP1 RNA expression in different cell subsets, suggesting that the SLE risk allele is under recent selection and has a regulatory effect [76]. Furthermore, UHRF1BP1 has been shown to be significantly differentially expressed in dendritic cells after Mycobacterium tuberculosis (MTB) infection [78]. Using full-genome sequence variation from the 1000 Genomes Project and the composite of multiple signals (CMS) test, Grossman et al. [10] reported the PTPN22, ITGAX (near ITGAM), ITPR3, and BLK regions to show evidence for recent positive selection. Several of the immune genes that have been identified in regions under selection are under the selective pressure of known pathogens, such as the Duffy blood group atypical chemokine receptor (DARC) gene to Plasmodium vivax malaria [79], ras homolog family member A (RHOA), and OTU domain ubiquitin aldehyde binding 1 (OTUB1) genes to Yersinia pestis (plague) [80], or the tyrosylprotein sulfotransferase 1 (TPST1) gene to HIV [81]. Several genetic regions associated with susceptibility to different autoimmune diseases show evidence of selection that has been attributed to host-pathogen coevolution, including the multiple major histocompatibility complex (MHC) [82][83][84] and the celiac risk locus SH2B3 as a protective factor against bacterial infection [85]. Karlsson et al. [86] have recently reported that cholera has exerted strong selective pressure on proinflammatory pathways, and Jostins et al. [87] reported considerable overlap between susceptibility loci for inflammatory bowel disease and mycobacterial infection. Variants in the IFIH1 gene, whose protein is a cytoplasmic helicase that recognizes RNA of picornaviruses and mediates induction of interferon response to viral RNA, have been shown to affect IFIH1 function and host antiviral response [88]. In the context of SLE predisposing loci, Clatworthy et al. [89] have shown that FCGR2B is important in controlling the immune response to Plasmodium falciparum, the parasite responsible for the most severe form of malaria, and suggests that the higher frequency of human FCGR2B polymorphisms predisposing to SLE in Asians and Africans may be maintained because these variants reduce susceptibility to malaria. The complement component (3b/4b) receptor 1 (CR1) gene has been shown to be a P. falciparum resistance gene [90] used by the parasite for host invasion. Machado et al. [91] have suggested that helminth infection has driven positive selection of FCGRs variation. Finally, Grossman et al. [10] implicated Salmonella typhimurium and other exposures that directionally drive selection of the toll-like receptor 5 (TLR5) gene [92]. Given that infectious organisms are strong agents of natural selection, it is plausible that alleles selected for protection against infection predispose to autoimmune diseases.
It is important to acknowledge the challenges and limitations inherent to the study of traits with complex genetic architectures and/or a less clear influence on survival and reproduction, such as SLE. As Castiblanco and colleagues [93] recently articulated, the differences in allele and genotype frequencies of diverse human populations depend upon their evolutionary and epidemiological history, including environmental exposures, which might explain why some risk alleles to autoimmunity may be protective factors to infectious diseases and vice versa in a given population (e.g., PTPN22 [94,95] and TNF [96]). Immune and infectious agents have been recognized as among the strongest selective pressures for natural populations, as shown by the identification of candidate adaptive alleles that functionally contribute to biological variation in contemporary populations. However, clarifying the relationship between the functional alleles and reproductive fitness in the environment in which they rose to a high frequency in the ancestors of the study population can rarely be attained. In complex diseases such as SLE, despite the established associations to specific regions or polymorphisms, the true causal variants still remain largely unknown. The emerging availability of genome-wide functional data allows the integration of an unprecedented amount of biological information to help identify potential functional variants and characterize their biological impact. Recent examples demonstrate how the integration of signatures of positive selection with phenotypic association studies and/or with regulatory data can improve the identification of functional loci [10,[97][98][99]. Also, the complex genetic architecture of SLE, resulting from the effects of many alleles of small effects, suggests that adaptation is likely to have occurred by simultaneous selection on variants at many loci. In this scenario, the response to selection is due to small frequency shifts of many alleles. However, most methods to detect selection rely on rapid fixation of strongly selected alleles. The development of novel analytical approaches to detect more subtle signatures of selection will improve the identification of selection signatures in complex diseases like SLE. Clearly, much remains to be done until the functional adaptive SLE risk loci are identified, the phenotypic consequences of these risk alleles elucidated, and the relationship between the functional alleles and reproductive fitness clarified. Recent progresses will provide the necessary tools to accelerate the discovery of these functional adaptive variants that increase the risk of SLE, which will improve knowledge about the etiology and deepen our understanding of the natural history of SLE. Further research regarding exploration of the interplay between infection, type of exposure, additional environmental factors, and autoimmunity will result in the discovery of multiple factors underpinning perhaps newly identified physiopathology mechanisms of SLE and autoimmune diseases [93].
In summary, this study has systematically queried the HapMap and HGDP populations for evidence for selection at SLE susceptibility regions and provides a comprehensive catalog of regions with both evidence for recent positive selection and association with SLE. These results provide support for recent positive selection influencing genetic variation associated with SLE, suggesting that populationspecific selective pressures may be one of the factors behind the high frequency of SLE risk alleles in the population and differential disease risk. Finally, these results support future analyses aimed at identifying the specific selective pressures and characterizing the functional mechanisms of adaptation and disease predisposition.