Cumulative Small Effect Genetic Markers and the Risk of Colorectal Cancer in Poland, Estonia, Lithuania, and Latvia

The continued identification of new low-penetrance genetic variants for colorectal cancer (CRC) raises the question of their potential cumulative effect among compound carriers. We focused on 6 SNPs (rs380284, rs4464148, rs4779584, rs4939827, rs6983267, and rs10795668), already described as risk markers, and tested their possible independent and combined contribution to CRC predisposition. Material and Methods. DNA was collected and genotyped from 2330 unselected consecutive CRC cases and controls from Estonia (166 cases and controls), Latvia (81 cases and controls), Lithuania (123 cases and controls), and Poland (795 cases and controls). Results. Beyond individual effects, the analysis revealed statistically significant linear cumulative effects for these 6 markers for all samples except of the Latvian one (corrected P value = 0.018 for the Estonian, corrected P value = 0.0034 for the Lithuanian, and corrected P value = 0.0076 for the Polish sample). Conclusions. The significant linear cumulative effects demonstrated here support the idea of using sets of low-risk markers for delimiting new groups with high-risk of CRC in clinical practice that are not carriers of the usual CRC high-risk markers.


Introduction
Colorectal cancer (CRC) is one of the most frequent cancers diagnosed in the Polish population and it is the second when listed by mortality in men and third in women [1,2]. From all newly diagnosed CRC cases, only up to 10% is caused by a high-risk genetic predisposition [3]. Thus, a large proportion of genetic predisposition to CRC may be due to low-penetrance variants. However, while high-risk genes are generally well identified, still little is known about low-risk CRC susceptibility genes. Several studies have led to the identification of genetic markers with odds ratio (OR) ∼2 [4][5][6][7], although some results are inconclusive and clinical relevance of lowrisk markers cannot be definitely established [8][9][10][11].
In this study we genotyped 6 SNPs (rs380284, rs4464148, rs4779584, rs4939827, rs6983267, and rs10795668) among nonselected consecutive CRC cases and controls from Estonia, Latvia, Lithuania, and Poland to identify variants and cumulative sets of variants associated with colon cancer risk and to assess potential differences or similarities between these neighboring populations.
We analyzed the effect of each of those markers separately. But, assuming that small effect genetic markers may have a cumulative effect on compound carriers, we also tried to establish a potential set of markers that could account, in combination, for a high risk of CRC. A recent article signed by Dunlop et al. [14] successfully showed how cumulative effects of low-risk markers can be explored for CRC in several populations. However, cumulative effects of the markers which are object of the present study have not yet been analyzed. Here we follow a similar approach for a smaller number of genetic markers, including the size of the pool of potential risk markers as an additional variable.

Material
Four groups of patients were included in this study. The unselected newborns, used as controls in Groups 2 and 3, cannot be matched for age as in the case of the controls for Groups 1 and 4. That is, although they have no relationship with the CRC cases, it cannot be disclosed that some of them will develop CRC in the future, as they grow up. This situation decreases the statistical power of the study, because it is more difficult to identify true differences between cases and controls. Thus, we are increasing the risk of false negatives (Type II error), but on the other hand we are decreasing the risk of false positives (Type I error). In other words, while nonsignificant differences calculated for Groups 2 and 3 may be due to lack of statistical power, significant differences can only have values equal to or lower than calculated.
In all cases, peripheral blood samples were collected from the patients or controls after obtaining informed consent for genetic analysis. DNA was extracted directly from leukocytes following standard methodology [15].
The study was approved by the institutional review board of ethics of the Pomeranian Medical University (Poland).

Statistical Analysis.
Differences in the genotype distribution for cases and controls between countries were analyzed applying Pearson's Chi-squared test for 18 conditions (18 genotypes) and 4 groups (4 countries), that is, 51 degrees of freedom, each.
The mode of inheritance of the phenotype associated to the risk markers is usually not part of a GWAS analysis, but the presence of a single risk allele does not necessarily increase the disease risk (e.g., if the inheritance model is recessive). For each country, each of the analyzed markers was therefore analyzed separately for its most probable inheritance model. For simplicity, only two basic inheritance models were taken into account, recessive and dominant. According to the model chosen, the presence or absence of a risk genotype was assessed for each individual. Other models such as codominant, additive, and overdominant were intentionally left out of scope to avoid compromising unnecessarily the statistical power of the study.
The particular influence of each of the genotypes of the 6 markers on disease risk was calculated by logistic regression, taking the inheritance model into account and independently for each country. Sex or age was not available for the Latvian and Lithuanian sample and could not be systematically adjusted; therefore, we chose an unconditional regression model to analyze them. In contrast, the Estonian and the Polish samples had paired controls matched for sex and age. In these two cases, a conditional regression model was preferred. Bonferroni correction for multiple testing was applied in all cases since two different inheritance models were put to the test for each SNP and country.
Cumulative effects were explored by a similar approach, but this time focusing on the number of cumulated risk genotypes (again taking the inheritance model into consideration) for each individual. For each country, a list was generated with all risk markers sorted by increasing value (calculated as described above). Using these lists as a basis, ORs were calculated for compound carriers of different numbers of risk markers (following the order of the list). Here again, Bonferroni correction for multiple testing was applied, since there were five checked pools of risk markers for each country.
There were three choices for establishing the reference to calculate said ORs. One was to take the group of noncarriers as a reference, but frequently the size of that group was small (in one case even inexistent), and could therefore account for artificially high ORs. Another one was to take the most frequent group among controls, but that was a different group for each case, thus shifting the ORs curves up and down and making comparisons between countries and between different sizes of risk marker lists, virtually impossible. We decided to compare the observed proportion to the expected proportion (1 : 1) for the same sample size. In this way, all depicted ORs are directly comparable. The drawback is that this happens at the cost of some statistical power for the groups with larger numbers of cumulated risk markers (we make comparisons for smaller total sample sizes compared to what we would do with any other method). This method is very conservative and does not increase the risk of type I error, rather the opposite, but since the present study is an exploratory one, we were concerned more about false positives than about false negatives.
All calculations were done in R, version 2.15.2 [16].

Genotypes.
Genotyping success rate was 100% for all 6 SNPs. There was a significant deviation from Hardy-Weinberg equilibrium for rs4464148 among cases in the Lithuanian sample ( = 0.045); however, a deviation in the group of cases is not unexpected, since this group is not a representative population sample. The genotype distribution shows some divergences between samples. As shown in Table 1, there are differences greater than 12 percentage points between countries: for example, in the Latvian sample, 49.4% of controls had genotype AC for rs3802842, while in the Lithuanian sample there were 36.7% (actually closer to the percentage observed among cases in the Latvian sample, 33.3%). Other remarkable differences, larger than 12 percentage points, affected cases from Latvia and Lithuania for rs4779584 (genotype CC), rs3802842 (genotypes AA and AC), rs4464148 (genotype TC), and rs4939827 (genotypes CT and TT). The Latvian sample also diverged largely from the Estonian sample among controls for rs6983267 (genotype TT, ∼10 percentage points) and rs4939827 (genotype CT, ∼13 percentage points).
Still, these differences between countries were not statistically significant, neither in the control group (Pearson's Chisquared test, = 0.98, df = 51) nor in the group of cases (Pearson's Chi-squared test, = 0.95, df = 51). Analyzing each SNP separately does not change the situation; the lowest value does not show any significant difference between countries (Pearson's Chi-squared test, = 0.12, df = 6, for rs6983267 among controls).

Inheritance Models.
A general overview of the estimated disease risk (in OR and 95% CI) depending on the inheritance model and country for each of the analyzed SNPs is presented in Figure 1. There is an overlapping region of the 95% CI for each SNP and inheritance model for all countries, as expected from the fact that there were no significant differences in the genotype distribution. However, it can be seen that country and inheritance model have both a visible effect on the estimated disease risk. Exemplary, it can be seen for rs4464148 an OR similar for all countries for the dominant model (and all ORs are within the overlapping region of the 95% CI for all countries), but for the recessive model, Estonia, Latvia, and Lithuania have an estimated decrease in risk, opposite to Poland.
For each country and marker, the inheritance model is chosen that maximizes disease risk for the given risk allele. Only the markers rs4779584 and rs4464148 share the same inheritance model for all countries ( Table 2). The countryspecific analysis of the separate effects of each marker showed a marginal association for the markers rs6983267 and rs10795668 in the Polish sample and for marker rs4939827 in the Lithuanian sample. However, after applying correction for multiple testing, these associations were not significant any more. In contrast, marker rs3802842 did withstand the correction for multiple testing for the Lithuanian sample (corrected value = 0.022).
Interestingly, the analysis of the linear cumulative model, where the amount (quantitative, discrete) of risk genotypes carried by each individual was taken as an independent variable, showed a statistically significant association that withstood the correction for multiple testing, for all samples (Estonia: corrected value = 0.018; Lithuania: corrected value = 0.0034; Poland: corrected value = 0.0076) except for the Latvian one (nominal value = 0.137), which was the smallest sample of all four ( Table 2).
Knowing that the linear cumulative model was effectively explaining the observed data gave us the needed support to proceed to the next analysis step, where we tried to determine the cumulative effects of a particular amount of cumulated risk markers from a sorted list out of the 6 markers analyzed (see Section 2).  Taking these country-specific lists as a basis, with markers sorted by increasing value, we analyzed the influence of the number of cumulated markers on the risk of CRC. Disease risk was calculated for compound carriers of risk markers, separately for each country. Reference was the expected proportion compound carriers among cases and controls (1 : 1 in all cases). The curves depicting that relationship were systematically drawn for an increasing pool of markers out of which the number of cumulated markers was withdrawn: there is a curve showing that relationship for the first two markers of the list, a different curve for the first three markers, and so on for all six markers. Figure 2(a) shows these data separately for each country.
In all cases, the shapes of the curves ideally support the hypothesis of the cumulative model, where disease risk increases with the number of cumulated risk markers. Still, some curves are steeper than others (leading to higher odds ratios) and those are not necessarily the ones corresponding to the largest pool sizes. Only in the case of Poland, the highest odds ratio is reached for the largest pool of markers (all six markers analyzed).
The curve showing the highest risk for the lowest number of cumulated markers was represented again in detail (Figure 2(b)), with confidence intervals and a histogram depicting the proportion of cases and controls carrying those particular markers.
In the Estonian sample, the cumulative model for the pool of four markers reached OR 1.81 for an accumulation of all four markers. Analogously, the Latvian sample reached OR 2.16 for an accumulation of three or more risk markers for the pool of five markers. The Lithuanian sample achieved an OR of 4.37 for an accumulation of all four markers out of a pool of four. The Polish sample reached OR 2.16 for an accumulation of four markers out of a pool of four. However, as noticeable from the broad confidence intervals at each position, none of the differences in disease risk for neither of the samples was statistically significant for any of the possible marker pool sizes.

Discussion
In this study 6 SNPs were analyzed that, according to previous literature data, could be low-risk genetic markers for CRC. 1165 consecutive CRC cases and 1165 controls from Estonia, Latvia, Lithuania, and Poland were examined to assess whether these genetic variants are significantly associated with the occurrence of colon cancer in the Eastern Baltic States and Poland and whether any similarities or differences between the populations could be identified.
Comparison of the genotyping data from Poland and the Eastern Baltic states revealed some heterogeneity; however, differences were not statistically significant.     The association between colorectal cancer risk ( -axis) and the number of cumulated risk markers carried by a single subject (axis) is depicted at (a), independently for each country. The numbers attached to the end of each curve stand for the pool size of markers out of which the number of cumulated risk markers is calculated (see Section 2 for more details). The curve reaching the highest odds ratio (arrow) is represented in detail with confidence intervals and frequency histograms at (b). Note that the pool size of markers may be larger than the number of cumulated risk markers carried by a single subject. The odds ratio for 0 cumulated markers could not be calculated for the Estonian sample due to the complete absence of noncarriers among cases.
With the exception of rs4779584 and rs4664148 (both are dominant), the best suiting inheritance model, defined as the model showing the lowest value, for the rest of the markers was not consistent throughout the different countries (Table 2). Although previous studies on low-risk susceptibility genes had shown population-specific effects, these affected populations from geographically distant regions [17][18][19]. Here, rather similar inheritance models were expected because there were earlier data showing large genetic similarities between these neighboring populations, like that for the case of the mismatch repair genes [20] or the BRCA1 gene [21][22][23][24]. That heterogeneity in the inheritance models made a country-specific analysis more advisable than a pooled analysis, but at the cost of a loss of statistical power due to smaller sample sizes. A logistic regression analysis of each of the six markers, independently for each country, revealed a statistically significant association between a single marker and CRC risk, only for rs3802842 in the Lithuanian sample (dominant inheritance model: OR = 1.98, 95% CI = 1.17-3.36, and corrected value = 0.022). But most importantly, all cumulative models, with the exception of the Latvian sample (the smallest one), showed a significant increase in the risk of developing CRC for an increasing number of cumulated markers (corrected value = 0.018 for the Estonian, corrected value = 0.0034 for the Lithuanian, and corrected value = 0.0076 for the Polish sample).
Having demonstrated the cumulative effect of these six low-risk markers, we focused on particular combinations of markers that could be maximizing disease risk. Assuming that some markers would play a larger role than others, a country-specific list of markers was created, ordered by increasing value for an association with CRC risk. The cumulative effect was then tested for each pool size. Some of the marker combinations showed high odds ratios (up to 4.37 in the Lithuanian sample), but none of these differences was statistically significant.
To summarize, the present study demonstrated significant cumulative effects for the total of the 6 analyzed markers but failed to show significant effects of particular combinations assuming particular inheritance models.
Still, it is worth to mention some advantages shown by the proposed stepwise approach in comparison with previous analyses of cumulative effects where there is no focused analysis of ordered pools of markers [14]. Exemplary, from Figure 2 we can learn that maximizing the marker pool may lead to a relative decrease of disease risk (e.g., the pool of 4 markers has a higher odds ratio for the Estonian sample than the pools of 5 or six markers), as expected from the fact that some markers do not seem to have a large effect on disease risk (Tables 1 and 2), and, in the absence of interaction effects, may only lead to a decrease in the sensitivity and the power of the study if included in the model.
Further studies should include larger sample sizes and country-specific sets of genetic markers to create more accurate cumulative models before they could be applied in clinical practice.