A Likelihood Ratio Approach for Utilizing Case-Control Data in the Clinical Classification of Rare Sequence Variants: Application to BRCA1 and BRCA2

,


Introduction
Clinical genetic testing of disease susceptibility genes often identifies variants of uncertain significance (VUS), complicating the clinical management of carriers and their families [1].The assessment of the clinical significance of these rare sequence variants, including missense substitutions, inframe deletions and insertions, and intronic variants, is essential to directing the clinical management of carriers and their relatives towards appropriate prevention, early detection, and personalized treatments.
The most widely used method for the interpretation of germline variants is via the application of the standards and guidelines recommended by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) [2].Strength levels (very strong, strong, moderate, and supporting) are assigned to independent lines of evidence for or against variant pathogenicity.These strength levels are then combined and used in a scoring system to provide a clinical class, expressed as pathogenic, likely pathogenic, likely benign, benign, or VUS.These guidelines integrate various sources of information including the variant's nature and position (e.g., nonsense, frameshift, and missense) and clinical data (e.g., prevalence in affected individuals and controls), and the combination of this information is interpreted to establish the significance of the variant under investigation with respect to risk.These criteria were recently reinterpreted in a quantitative Bayesian framework, which derived ranges of likelihood ratios (LRs) consistent with each of the evidence strength levels [3].For case-control data, the specific criterion (PS4) states that a relative risk (RR) or odds ratio OR > 5 0 with nominal statistical significance (i.e., the confidence interval of the RR or OR does not include 1) provides strong evidence in favor of pathogenicity [2].
A significant advance in the classification of variants in cancer and other disease genes was the development of the multifactorial integrated likelihood ratio model [4]; this model combines multiple features under the assumption that each of them is an independent predictor of variant pathogenicity in a Bayesian framework, thus providing a quantitative estimate of the pathogenicity of a variant [5].The ENIGMA consortium [6] has been applying and extending this multifactorial likelihood model.To date, application of this model has included clinically calibrated prior probabilities of pathogenicity derived from bioinformatic prediction of variant effect and location, along with a combined LR derived from clinical data [5], such as family history of cancer [7], breast cancer tumor pathology [8], variant cosegregation with disease [9,10], and variant cooccurrence in trans with a pathogenic variant (PV) in the same gene [7].This model can also incorporate LRs derived from variant frequency in cases and controls.Recently, casecontrol information derived from genotype data for 20 variants was incorporated into a comprehensive multifactorial likelihood analysis of BRCA1 and BRCA2 variants by ENIGMA [11], using a method incorporating gene-and age-specific penetrance of PV carriers only.Such casecontrol LR calculations take into consideration gene-and age-specific penetrance values, and hence they might be expected to outperform the statistical measures currently recommended by ACMG/AMP for the analysis of casecontrol data (i.e., OR or RR estimates).
In this paper, we present a novel case-control LR method, based on the same principle as used in Parsons et al.'s [11], that incorporates age information in both 5 Human Mutation carriers and noncarriers in the dataset.The method can be used to obtain evidence in favor or against pathogenicity for rare variants in any gene for which there exist known age-specific penetrance estimates based on data obtained from case-control studies.We illustrate the use of this method to calculate LRs for 24 BRCA1 and 68 BRCA2 variants from breast cancer case-control genotype data generated by the Breast Cancer Association Consortium (BCAC) as part of the large-scale OncoArray project [12].We further demonstrate the utility of this case-control LR approach to aid in the interpretation of the clinical significance of variants using evidence aligned to ACMG/AMP code strengths or other classification methods.

Case-Control Datasets
2.1.1.Simulated Case-Control Dataset.Genotype data simulations were performed using the R (v3.6.1)(https://www.rproject.org/)statistical computing language.To create casecontrol datasets, genotypes for cases and controls were simulated using a Poisson distribution with lambda (λ) equal to the mean number of events (variant carriers) in the given interval, expressed as where N denotes the sample size, RR denotes the relative breast cancer risk of the causal variant and MAF denotes the minor allele frequency of the variant in the general population.Ages were simulated using a normal distribution, with the mean and standard deviation following the genespecific age distribution in the CARRIER population-based study [13].Genotype data simulations were carried out for variants conferring a RR of 1 (indicating no increased risk), 2, 3, 4, 5, 6, 7, 8, 9, or 10, minor allele frequency in controls of 0.0001, 0.00005, or 0.00003, and sample size of N = 20,000 (20,000 breast cancer cases and 20,000 controls), 30,000 (30,000 breast cancer cases and 30,000 controls), or 50,000 (50,000 breast cancer cases and 50,000 controls).For each of these 90 scenarios, we simulated 10,000 replicates.
Additionally, in order to account for the possibility that age information is not available, we repeated the analysis using same age for all individuals.
2.1.2.BCAC OncoArray Dataset.Genotype data were generated as part of the BCAC component of the OncoArray project [12] (studies included in the analysis are listed in Supplementary Table S1) and were available for 75,657 breast cancer cases and 52,987 controls of European ancestry.The majority of studies were population-based case-control studies or case-control studies nested within population-based cohorts.However, a subset of studies oversampled cases with a family history of breast cancer.Of these, 464 breast cancer cases and 1,347 controls had missing information regarding their age at diagnosis or interview, respectively and were excluded from the analyses.Another 1,445 cases and 858 controls were removed because their ages fell outside the interval of 21-80 years (the age range for which penetrance estimates were available).Cluster plots of 56 BRCA1 and 127 BRCA2 variants, nominated by ENIGMA researchers for inclusion in the OncoArray project were manually checked to review the automated calls.This was performed since automated genotype calling for rare variants from GWAS chips has been shown to be suboptimal [14].Genotypes were adjusted for 41 BRCA1 and 91 BRCA2 variants, while 3 BRCA1 and 2 BRCA2 variant genotypes were determined to have been called correctly by automated clustering.Genotype recalling was not performed for 12 BRCA1 and 34 BRCA2 variants due to the low quality of the genotype data; these variants were not considered further.
After genotype cluster review and recalling, 16 BRCA1 and 19 BRCA2 variants were excluded from further analysis due to their high frequency (>0.1%).Additionally, casecontrol LR calculations were not possible for four BRCA1 and six BRCA2 variants due to the absence of variant carriers in the postfiltering dataset.After these exclusions, casecontrol LR and logistic regression analyses were performed for 24 BRCA1 and 68 BRCA2 variants.It should be noted that some of the variants selected for the array have subsequently been classified or were those whose pathogenicity status were known and were included as positive or negative controls.

Statistical Analyses
2.2.1.Case-Control Likelihood Ratio Method.This method (detailed in Supplementary File 1) compares the likelihood of the distribution of the variant of interest among cases and controls under the hypothesis that the variant is associated with similar risks of the disease in question as the "average" pathogenic variant (H p ), compared to the likelihood under the hypothesis that it is a benign variant not associated with increased risk (H b ).These risks may be age-, sex-, and/or country-specific.Thus where Data denotes observed data on carrier status of a variant of interest, case-control status, and age at diagnosis or interview, combined over all individuals in the dataset.
In order to calculate the above LR, we follow a survival analysis framework.We first determine the probability that an individual with genotype k remains unaffected at age t, S k t , and the corresponding probability that an individual with genotype k is affected at age t, f k t (where k = 0 or 1 for non-carriers and carriers, respectively).These probabilities can be computed from the age-specific baseline incidence, λ 0 t , and the age-specific log-relative risk of an assumed pathogenic variant in the gene of interest, β t .These probabilities are given by 6 Human Mutation As detailed in Supplementary File 1, the likelihood ratio is to close approximation, given by where N is the total number of individuals, K is the number of variant carriers, v j is the variant status (0 for noncarriers and 1 for variant carriers), and d j is the disease status (0 for controls and 1 for cases) for individual j.The baseline incidence rates λ 0 t were taken from the age-specific background rates for England and Wales (1998-2002) (https://ci5.iarc.fr/CI5I-X/Default.aspx),and the age-specific breast cancer relative risks for pathogenic variant carriers β t were taken from the recent large-scale BRIDGES (Breast Cancer Risk after Diagnostic Gene Sequencing) project [15].To allow for possible carrier frequency differences by country, stratified LR calculations were performed within each country and then multiplied to provide a final LR.
In a series of sensitivity analyses, the method was applied using three other published RR estimates: from case series unselected for family history of breast cancer [16], cohort series of BRCA1 and BRCA2 carriers [17], and breast cancer hazard ratio estimates for missense BRCA1 and BRCA2 variants [18].In order to account for country-specific effects, the stratified analysis was also performed using age-and country-specific incidence rates derived from the Cancer Incidence in Five Continents, volume 9,[1998][1999][2000][2001][2002], (https://ci5.iarc.fr/CI5I-X/Default.aspx).Age-specific breast cancer incidences for Greece and North Macedonia were retrieved from the 2020 cancer registry (European Cancer Information System (ECIS), https://ecis.jrc.ec.europa.eu/)since cancer incidence data were not available for the years 1998-2019.Unstratified analyses were also performed for comparison.
Detailed R scripts and preformatted Excel calculators (user can either input individual-level data or tabulated by age groups) for the calculation of case-control LRs can be found using the following GitHub link (https://github.com/BiostatUnitCING/ccLR).The files provided can be used to derive estimates based on the RR from Dorling et al. [15], Kuchenbaecker et al. [17], or Antoniou et al. [16].In addition, this method can also be used to compute case-control LRs for variants in other disease susceptibility genes by using age-specific penetrance estimates for the gene of interest (indicated by "custom" gene in the preformatted Excel calculators and R script).Furthermore, to allow for the possibility that age information is not available (or is only available for a subset of the dataset), the user can incorporate individuals with unknown age at diagnosis or interview into any of the age groups specified in the tabulated calculator.

Odds Ratio Analysis.
Odds ratio analysis was performed using logistic regression adjusted by age and country (if applicable) and Fisher's exact test (corrected using Haldane's method when simulations resulted in zero variant carriers in cases or controls [19]).Logistic regression p values were estimated using the likelihood ratio test.Based on the original ACMG/AMP recommendations [2], an OR estimate greater than 5.0, with the confidence interval not including 1.0, was used to define strong evidence of pathogenicity (PS4).

Evaluation and Application of the Case-Control
Analyses Methods.The simulated datasets were analyzed using the novel case-control LR method, logistic regression (adjusted by age), and Fisher's exact test.The case-control LR method was applied using age-specific breast cancer ORs for BRCA1 and BRCA2 PVs [15].For causal variants with a relative risk of 2 to 10, the power of the case-control LR method was estimated either as the probability of reaching at least supporting (LR ≥ 2 08) or at least strong pathogenic (LR ≥ 18 7) evidence.For benign variants with a relative risk of 1, the power of the case-control LR method was estimated either as the probability of reaching at least supporting (LR ≤ 0 48) or at least strong (LR ≤ 0 053) benign ACMG/AMP evidence.Correspondingly, type I error for pathogenicity was calculated as the probability of obtaining at least supporting or at least strong pathogenic ACMG/AMP evidence when the relative risk was set to 1. Equivalently, type I error for evidence against pathogenicity was calculated as the probability of obtaining at least supporting or at least strong benign ACMG/AMP evidence when the relative risk was greater than one.The power of the OR methods was estimated as the probability of reaching the ACMG/AMP PS4 criterion (OR > 5 0, CI not including 1.0, p value <0.05).Following the analyses results of the simulated datasets, optimal LR cut-offs (to maximize power and minimize type I error) are used to define ACMG/AMP evidence strengths for the 92 variants included in the BCAC OncoArray dataset.

Simulated Datasets.
Based on the simulation results for high-risk BRCA1 (RR > 9) and BRCA2 (RR > 5) variants, LR of strong and very strong evidence in favor of pathogenicity (LR ≥ 18 7) and of at least supporting evidence against 7 Human Mutation pathogenicity (LR ≤ 0 48) should be used in order to maintain a high power (>80%) and low type I error (<0.05) (Supplementary Table S2).
Results for all measures in all simulated datasets show that the power to achieve strong evidence in favour of pathogenicity is consistently greater for the case-control LR method using age-specific breast cancer risks compared to standard OR analysis methods (Figure 1, Supplementary Table S2).The power to correctly categorize variants with a RR comparable to a typical BRCA1 PV was >80% in all scenarios except for small datasets (N ≤ 30,000) with causal variants present at a lower frequency (MAF = 0 00003) (Figure 1(a)).
In addition, the case-control LR method can also be used to obtain evidence against pathogenicity, something that cannot be achieved using standard OR analysis methods.Results from simulated case-control datasets of benign variants (RR of 1, Figure 2) show that the casecontrol LR method using the age-specific RRs of the "average" BRCA1 PV exhibits adequate power (>80%) to identify variants with evidence against pathogenicity (LR ≤ 0 48) for larger datasets (N ≥ 30,000) and a MAF of 0.0001.
The implementation of the method to account for datasets with missing information, assuming the same age for all individuals, demonstrated reduced power and increased type I error in all simulations.However, the type I error was still less than 0.05 in all cases (Supplementary Figures S1 and S2, Supplementary Table S3).
Figure 1: Performance of the case-control likelihood ratio method and odds ratio analysis in providing at least strong ACMG/AMP evidence in favor of pathogenicity (LR ≥ 18 7) using simulated datasets.Power equals the probability of reaching at least strong pathogenic ACMG/AMP evidence.Genotype data simulations were carried out for causal variants conferring disease relative risk between 2 and 10.We performed 10,000 simulations for each case scenario.Results represent simulated case-control data for 20,000 (a-c), 30,000 (d-f), or 50,000 (g-i) breast cancer cases and controls and minor allele frequency of 0.00003 (a, d, g), 0.00005 (b, e, h), or 0.0001 (c, f, i).ccLR: case-control likelihood ratio; MAF: minor allele frequency; N: sample size.
8 Human Mutation 3.2.2.Case-Control LRs and ACMG/AMP Code Strengths.In the country-stratified baseline analysis (using the breast cancer ORs estimated from BRIDGES [15]), evidence in favor of pathogenicity (defined as LR ≥ 18 70 following the simulation cut-offs) was achieved for 6 variants (6.5%) (Table 2), of which 3 variants were assigned very strong and another 3 strong strengths.Evidence against pathogenicity (defined as LR ≤ 0 48) was observed for 59 variants (64.1%), of which 26 were assigned very strong, 14 strong, 7 moderate, and 12 supporting strengths.The results for the remaining 27 variants (29.3%) were uninformative.Case-control LRs and corresponding ACMG/AMP code strengths for all 92 BRCA1 and BRCA2 variants are shown in Supplementary Table S4.The different sensitivity analyses did not show any major discrepancies in the estimated LRs (Supplementary Table S5).

Discussion
This study provides a detailed description of the methodology to calculate case-control LRs for rare variants using case-control data based on age-and gene-specific relative risks and age information for noncarriers.The LRs are calculated by comparing the likelihood of the distribution of the variant of interest in cases and controls under the hypothesis that the variant has similar age-specific relative risks as the "average" pathogenic variant, compared to the hypothesis that it is not associated with increased (or decreased) disease risk.We evaluated the method using simulated datasets and further applied it to derive LRs for pathogenicity for individual variants from the analysis of genotype data from a large case-control study.These can now be used in combination with other evidence to inform variant classification-either according to ACMG/AMP classification standards and guidelines [2,3] or using multifactorial likelihood modelling approaches [4,11].Further, we provide user-friendly scripts and preformatted Excel cal-culators to facilitate the future implementation of this method for the calculation of case-control LRs.These resources may be readily applied for the calculation of LRs to be used in the classification of VUS in the BRCA1 and BRCA2 and other disease susceptibility genes with known penetrance values.Notably, our results demonstrate the improved performance of our LR-based method for assessing variant pathogenicity as it considers gene-and age-specific penetrance for carriers and age information for noncarriers.Using simulated case-control datasets, we show that the case-control LR method using age-specific breast cancer ORs from high-penetrance genes (e.g., BRCA1 and BRCA2) outperforms other OR analysis methods.These observations reflect the fact that the method presented here is more suitable for the analysis of rare variants in a case-control setting.We further provide cut-offs of LRs in favor or against pathogenicity to be used in a real setting.
Analysis of the BCAC OncoArray data using our proposed method provided informative pathogenic ACMG/ AMP classification evidence for six out of the 92 variants analyzed.Furthermore, 59 variants reached evidence against pathogenicity, something that is not directly measured as a code strength through classical calculations of ORs.Given that, a priori, the vast majority of rare sequence variants (e.g., BRCA1 and BRCA2) will be neutral with respect to risk, this is a key advantage of our approach.In contrast, using logistic regression analysis, the informative ACMG/AMP classification criterion PS4 (OR > 5 0, p value <0.05, and CI not including 1.0) was reached only for two variants.
There are possible caveats that should be recognized.The selection of cases or controls for a family history of cancer would affect the carrier probabilities.The likelihood ratios would then be inaccurate, but in principle, this could be considered by incorporating family history into the likelihoods, if known.Depletion of cases with known pathogenic variants by prior clinical sequencing could also bias the likelihood ratios; therefore, the method is best applied to Figure 2: Performance of the case-control likelihood ratio method in providing ACMG/AMP evidence against pathogenicity, using simulated datasets.Power equals to the probability of reaching at least supporting benign ACMG/AMP evidence (LR ≤0.48) when the relative risk was set to 1.We performed 10,000 simulations for each case scenario.Results represent simulated case-control data for 20,000, 30,000, or 50,000 breast cancer cases and controls and minor allele frequency of 0.00003, 0.00005, or 0.0001.ccLR: case-control likelihood ratio; MAF: minor allele frequency; N: sample size.Human Mutation population-based case-control studies.For these reasons, we highlight the ACMG/AMP recommendation to review all available evidence for/against pathogenicity for a given variant and to denote obviously conflicting findings for different evidence types, before assigning a final classification.A conservative approach may be to assign case-control weight with a cap, for example, at moderate strength for or against pathogenicity.

Human Mutation
Our method gains power in part because it leverages data on individual-level age, but we have to acknowledge that age is not always available.The method can be implemented more approximately by assuming that individuals with unknown information are of the same age, but this reduces power because the expectation that carriers of risk variants develop the disease at a younger age is then not utilised.It may also increase type I error because the likelihood ratio may be calculated for an age that is not appropriate for the dataset (for example, if the dataset consists predominantly of older individuals), although the type I error was still low in the simulations we considered.In the tabulated, preformatted calculator, we allow the user to incorporate individuals of unknown age at diagnosis or interview into any of the age groups specified.A conservative approach would be to include individuals of unknown age in the oldest age group.In this way, case-control genotypes from both existing data and new series, with and without age data, can be incorporated.However, we would like to emphasize that pooling series, particularly from different populations with different age/ethnicity structures or with different genotyping technologies, can lead to biased results.Ideally, datasets should be analysed separately, and the overall likelihood ratio generated by multiplying the study-specific likelihood ratios.

Conclusions
This manuscript describes in detail a novel method used for the calculation of the case-control LR to provide evidence of variant pathogenicity.This LR method is more informative compared to logistic regression analysis (or an OR calculation based on contingency tables and Fisher's exact test).It improves power as it considers age-and genespecific penetrance values and age information for noncarriers and can provide both evidence in favor of and against pathogenicity.In addition, this method can also be imple-mented towards the classification of VUS in any disease susceptibility gene for which disease penetrance has been reliably estimated.Open-access scripts and preformatted Excel calculators with code and instructions on how to use the method are available at the following address: https://github.com/BiostatUnitCING/ccLR.

Ethical Approval
This research has been approved by the Cyprus National Bioethics Committee.All participating studies were approved by the relevant ethics committees, and informed consent was obtained from study participants [12].For NHS and NHS2, the study protocol was approved by the institutional review boards of the Brigham and Women's Hospital and Harvard T.H. Chan School of Public Health, as well as those of participating registries as required.The ethical approval for the POSH study is MREC/00/6/69, UKCRN ID: 1137.

Disclosure
The EU Horizon 2020 Research and Innovation Programme funding source had no role in study design, data collection, data analysis, data interpretation, or writing of the report.The content of this manuscript does not necessarily reflect the views or policies of the National Cancer Institute or any of the collaborating centers in the Breast Cancer Family Registry (BCFR), nor does mention of trade names, commercial products, or organizations imply endorsement by the USA Government or the BCFR.J.L.H. is a National Health and Medical Research Council (NHMRC) Senior Principal Research Fellow.M.C.S. is a NHMRC Senior Research Fellow.The content of this manuscript does not necessarily reflect the views or policies of the National

Table 2 :
Variants with informative LRs in favor of pathogenicity, estimated by the baseline analysis.