Study Designs in Genetic Epidemiology

3 هرماش / مهدزاود لاس .موس هرازه رد کیتنژ Genetic epidemiology, as a relatively new science, investigates the independent role of genetic and environmental factors in the occurrence and progression of diseases. Genetic epidemiology studies, depending on the objectives, encompass the most preliminary surveys such as looking for family history of diseases to the most advanced surveys including specific strategies in prevention of genetic diseases using clinical trials. Various goals require special methods and study designs in genetic epidemiology studies. In the present study, main designs including familial aggregation, heritability, segregation, linkage and association have been studied and the purpose of each study design and the used tests are briefly discussed. Knowledge of different study designs and related tests is the main and necessary issue in accurate implementation of the studies and is also a guide to the shortest possible ways for reaching the goals.


Introduction
Epidemiology is the study of distribution and determinants of disease frequency in human populations and the use of this information to control and promote health [1].The goal of epidemiologic research is to collect valid and precise information on the causes, prevention, and treatment of disease [1].Genetic epidemiology is the study of the role of genes and their interaction with environmental factors in the occurrence of disease in human populations [2].The branch of genetic epidemiology is still quite young, although the parents of that (epidemiology and genetics) have rather long history [3].The objectives of epidemiological studies in genetic science are to determine the risks related to allelic variants of candidate genes, to map more accurately regions of the genome for which there is evidence of linkage to disease susceptibility, and to contribute cases to a genomewide search for susceptibility genes [4].

Study Designs in Classic Epidemiology
The selection of one design over another in studies depends on the particular research question [3] and also on cost, time, and ethical considerations.The most common types of studies are listed with brief explanations about them in Table 1 [1,[5][6][7][8].

Study Designs in Genetic Epidemiology
Similar to classical epidemiology, observational studies in genetic epidemiology are divided into descriptive and analytical studies.In descriptive studies, the pattern of variation in disease or behavior among immigrants, familial groups, as well as racial/ethnic groups, social classes, and temporal, age, and gender variations is surveyed and can provide clues to whether genetic or environmental factors are involved [8].In analytical studies, the effect of genetic exposure on disease is analyzed, and if it is proved that there are causal relationships, the genetic region responsible for the disease is identified.Therefore, relying on familial relationship in genetic epidemiology is one of the main differences between classic and genetic epidemiology.In Table 2, main designs in genetic epidemiology and process of surveys in genetic diseases are summarized [8][9][10][11][12][13][14][15][16].

Analytical Study in Genetic Epidemiology
4.1.Familial Aggregation.The first step in the study of a potentially genetic characteristic is to realize whether it tends to aggregate in families, without having any specific genetic model in mind [11].In other words, familial aggregation is the tendency for disease to cluster in families [17].A simple approach to assess aggregation in families is to identify a group of affected case subjects and a group of healthy control subjects and compare the odds of a positive family history in case subjects who report a disease to the odds of a positive family history in the control group.Finally, an odds ratio can be estimated according to Table 3 [1].
While this approach is valuable, there is no way to control for individual or environmental risk factors for each relative, which might be driving the aggregation (e.g., cooccurrence of smoking behaviour and lung cancer aggregation in families); it is also imprecise since the probability of a positive family history of disease increases with age and the number of relatives considered [17].One way to control confounding variables is to use categorizing methods such as logistic regression modeling [17].

Heritability.
Heritability is the proportion of variation in a trait that is attributable to genetic differences.One of the simplest designs in heritability study is the "twin study." This study design uses the variance component framework and its frequency to estimate heritability; in this way, if the ratio of genetic variance to phenotypic variance (general heritability) is close to 1, this is evidence for a genetic component [18].

Segregation Analysis.
Segregation analysis aims to determine whether the pattern of disease occurrence in families fits a particular type of inheritance.For example, Huntington's disease is controlled by an autosomal dominant allele.Thus, if one parent has "Aa" genotype and another genotype, "AA, " due to the possible genotypes of "AA" and, "Aa" in the offspring, it is expected that all the children will be affected.Therefore, if the risk of a heritable disease in a large population is 50% among girls and boys, it can be concluded that heritability is of autosomal dominant type.Binomial distribution can be used for analysis.Thus, if  is the sample size,  is the number of affected offspring, and  is the probability for a case to be affected (supposed to be 50%), by assuming that H0:  = 1/2, we will have 4.4.Linkage Analysis.The goal is to find the approximate location of the responsible gene or genes [10,19,20].In linkage analysis, two-point LOD scores evaluate the evidence for linkage between the disease locus and only a single marker; when more than one marker is considered, multipoint LOD scores are reported [11].Two broad types of linkage analysis exist: parametric and nonparametric.If there is enough information for knowledge of parameters (mode of inheritance and DNA from multiple members of informative families), it is possible to use modelbased (parametric) linkage [17]; however, when the genetic model is unknown, nonparametric analysis should be used [17].Parametric linkage analysis is a powerful strategy for mapping genes with a simple Mendelian form of inheritance [11].A result of linkage analysis is usually expressed in terms of an LOD score (logarithms to the base 10).The LOD score is a function of the recombination fraction () [21].Recombination fraction estimates probability of recombination between two markers.Although a probability, its maximum value is 0.5 indicating a 50 : 50 chance of recombination, or that two loci sort independently and are unlinked but a recombination fraction less than 0.5 indicates that two loci are not sorting independently and there is linkage between them [22].The LOD score is computed by comparing the likelihoods for a range of value  and comparing with the likelihood when  is equal to 0.5.
Therefore, with assuming, , the likelihood probability and,   , recombination fraction and maximum recombination fraction, denoted as  ( = 1/2), LOD is assessed in the following manner [17,[23][24][25]: An LOD score of 3 (which represents odds of 1000 : 1) or greater in favour of linkage is used to indicate statistically significant linkage.If this score is minus 2 or less, linkage is unlikely [22].

Association Study.
Association studies are similar to case-control studies except that the disease associated "exposures" that one seeks to identify are variant alleles of genes [11].These kinds of studies are used to find more common genetic variations that are highly prevalent in the general population [17].In practice, frequencies of variant alleles among affected individuals are compared to unaffected individuals [26].
There are two types of association studies.The first one is the candidate gene study [17,27], which focuses on the specific gene, in addition to what may be accomplished by the identification of a gene product such as a specific protein rather than the gene itself [27].The second one is genomewide study (GWAS), comprising a wide search of the genome for genes that are related to the disease [27].The genes discovered thus far using GWAS have been common loci of small effects, but many genetic epidemiologists believe that, when taken together, these genes of small effects may cumulatively have vital effects on the risk for complex diseases [28].Association studies are subdivided into two types of analysis, direct and indirect [17].In direct association studies, the candidate gene has been designated and association is tested directly; but in the indirect one, candidate gene has not been identified and is linked to marker genes [16].The direct method uses an actual causative mutation at a particular gene in the test for association with the disease phenotype.The indirect method uses linkage disequilibrium method.Linkage disequilibrium (LD) describes the strength of a relationship between alleles at different loci [23,28].More specifically, if knowledge of an allele at one locus can predict the allele that will reside at a second locus, then linkage disequilibrium exists between the alleles.However, if knowledge of an allele at the first locus cannot help predict the allele that is at the second locus, then linkage equilibrium (LE) exists [11,16,29].There are many statistical measures for LD, and the more common metrics are Lewontin's  prime index, odds ratio (logistic regression), correlation, and  2 independence [30][31][32][33].
If alleles C and G are associated (in LD), in other words, they are dependent on each other, then where  is the raw disequilibrium coefficient.Therefore, if two alleles are in linkage equilibrium, then  = 0.The raw disequilibrium coefficient, , can be difficult to interpret because it is dependent on allele frequencies at the two loci.  is a scaled version of  that measures LD as a proportion of the maximum amount of LD possible for the specific allele frequencies at the two loci.It can take values from −1 to +1; if   is equals to 1, it means that there is complete LD as follows: max can be estimated by the following equation (, allele frequency in a locus, and , allele frequency in another locus: 4.5.2.Odds Ratio [11,29].One of the other indicators for the analysis of the relationship is comparing the odds of exposure (genotype or allele) in case group with the odds of exposure in controls group with assuming H0 : odds ratio is = 1 we have (Table 4).[31]. 2 measures the correlation between alleles with a range of −1 to +1; in a situation of linkage disequilibrium,  2 will be 1 and can be calculated as follows:

Odds Ratio in Matched
Studies.An allelic variant of a candidate gene or of a genetic marker was associated with increased risk of disease; one would expect that variant to be transmitted from a heterozygous parent to an affected offspring more often than the 50% frequency expected by chance.Assuming a biallelic locus, let b be the number of times; then, the A1 allele is transmitted from a heterozygous A1A2 parent to an affected offspring and  is the number of times the A2 allele is transmitted from heterozygous A1A2 parent.It will be recognized as McNamara's test for a pairmatched case-control study.The "case" is the transmitted allele and the "control" is the nontransmitted allele [35].
McNamara's Test [32].McNamara's test is a nonparametric method used on nominal data.It is applied to 2 × 2 contingency tables with matched pairs of subjects [32].In 2 × 2 McNamara's table, subjects with offspring alleles E− and parent allele E+ will be in  cell, while subjects with offspring alleles E+ and parent allele E− will be in  cell; see (Table 5).
By assuming H0, it is expected that  +  = 1/2; then, we have And briefly, One potential limitation of association study is the probability of a false positive relationship between markers and genes [33].Therefore, it should be noted that significant causation test in marker and alleles confirm causation and do not confirm linkage disequilibrium [34,36].The two loci may tend to be inherited together without the causality of the disease.This condition is more common in small populations (ethnic or tribal assembly) that have a lot of shared traits [37].LD can be influenced by several factors, including chance, selection, migration, and isolated populations [38,39].Another restriction of association study is variation of test power when the disease allele is recessive compared with dominant allele [11].Association studies (based on control group) are classified in to two groups: related case-control and unrelated case-control studies.
In the related case-control studies, relatives of case patients are used as control subjects.These designs can have various control groups, such as sib-control, cousin control and pseudosibling.Although the use of sibling control offers several advantages over population controls (unrelated casecontrol), such as more willingness to participate and reducing cost and time, it has some disadvantages like the probability of overmatching and limitation in age and sex matching in small families [39].The advantage of a cousin-control group is that one may be able to obtain closer matching on age and year of birth, with less loss in efficiency because the case and cousin are not as closely matched on genotypes, in addition to less chance of overmatching because the case and cousin have only one side of their families shared [39].In pseudosibling designs, no actual controls are selected; instead, genotypic data are obtained on the parents of the case, and the genotype transmitted to the case is then compared with the three genotypes (pseudosibling) that were not transmitted to the case.The question this design seeks to address is whether a specific allele or genotype occurs more commonly in cases than in their pseudo-sibs [39].

Estimation of Interaction in Exposures.
One of the most functional studies in genetic epidemiology is the case-only which is used for cross-sectional, cohort, or case control patients [35].The case-only design which is to assess geneenvironment interaction was presented by Aalen et al. and then by Hamajima et al. [35].The case-only study design used to study gene-environment interaction has been criticized for its susceptibility to bias caused by nonindependence between genetic and environmental factors [39][40][41].Each person with disease (D) that is coded as positive or negative for a genetic factor (G) and environmental factor (E) can be located in one of the four situations of (D++), (D-), (D±), and (D±).It is noteworthy that when G and E are independent in the source population, the case-only OR is equivalent to the interaction estimate based on RRs, regardless of disease risk.Conceptually, the interaction between G and E refers to the extent to which the joint effect of the two factors on disease differs from the independent effects (effect of each of the factors on disease in the absence of the other factor) [42][43][44].
Multiplicative interaction is assessed by comparing the joint effect (effect on D due to the presence of both factors) with the product of the independent effects (product of effect on D in the absence of other).For example, if independent effect of G equals 3 and independent effect of E equals 2, then we would expect the joint effect of G and E to be 6 if there is no multiplicative interaction [44].
Given that independence between genetic and environmental factors is critical to the validity of case-only estimates of interaction [40,41], when G and E are independent, the OR relating G and E are equivalent to the interaction effect of G-E; see Table 6.
In practice, if the sources of nonindependence are measured, classification or adjustment for using multiple models can be used to remove the bias in case-only analysis of interaction [41].

New Designs in Genetic
Studies.Although case-control studies are commonly used for genetic-epidemiologic studies, an increasing number of cohort studies have been established over the past decade [45,46].Two progressive studies, the nested case-cohort and nested case-control, have recently been suggested.The major advantage of nested designs is their ability to match controls with cases on followup duration [4].

Study Designs in Mycobacterium Tuberculosis
Twin studies are one of the primary and inexpensive heritable studies on tuberculosis (TB), which provided valuable and important information about the etiology of TB.Because twins theoretically share the same environment, higher rates of concordance for monozygous (identical) twins than for dizygous (fraternal) twins suggest that genetic factors are important in susceptibility to tuberculosis and provide an estimate of the magnitude of this effect [47,48].During the past 15 years, various surveys have been carried out on the genetics of susceptibility to mycobacterial diseases [49,50].Etiology effects on tuberculosis have been used in casecontrol studies, too, like the case-control study carried out in Gambia, which showed that polymorphisms in the NRAMP1 gene were significantly associated with susceptibility to tuberculosis [51,52].Another case-control study in London showed VDR gene effect on susceptibility to TB [52].Using association designs, important pathogeneses of tuberculosis have been discovered, too, such as NRAMP1, vitamin D3 receptor, interferon-, interleukin-1, interleukin-12, tumor necrosis factor-, interleukin-4 and interleukin-10 [53][54][55].Linkage studies have also shown that there is disease susceptibility gene or genes in the neighbourhood of the marker, and detailed investigation of genes in the region is indicated.Such a genome-wide scan of affected sibling pairs from Gambia and South Africa identified potential susceptibility loci on chromosomes 15q and Xq [49,56] Deng [56] have reviewed the use of genetic linkage and association studies in the investigation of genetic susceptibility to infectious diseases.Implementation of such studies in developing countries presents some particular challenges.However, it is obvious that since tuberculosis occurs mainly in adults, parents of a case are frequently unavailable for genotyping.But using unaffected siblings as controls is possible [57].In the study of complex diseases, as TB, because the effects of genes may be modified by environmental (i.e.non-genetic) factors, gene-environment interactions may be explored in study designs, such as case-only, cross-sectional, cohort, and casecontrol studies and family-based designs, such as case-parent, affected sibling pair and twin studies [57].

Table 1 :
Main study designs in classic epidemiology.

Table 2 :
Main study designs in genetic epidemiology.