The Importance of Cancer Registry Linkage for Studying Rare Cancers in Prospective Cohorts

Large prospective cohort studies may offer an opportunity to study the etiology and natural history of rare cancers. Cancer diagnoses in observational cohort studies are often self-reported. Little information exists on the validity of self-reported cancer diagnosis, especially rare cancers, in Canada. This study evaluated the validity of self-reported cancer diagnosis in Alberta's Tomorrow Project (ATP), a provincial cohort in Canada. ATP data were linked to the Alberta Cancer Registry (ACR). The first instance of self-reported cancer in a follow-up survey was compared to the first cancer diagnosis in the ACR after enrollment. The sensitivity and positive predictive value (PPV) were estimated for the reporting of cancer status, reporting of common or rare cancer, and reporting of site-specific cancer. Logistic regression analysis explored factors associated with false positive, false negative, and incorrect cancer site reporting. In the 30,843 ATP participants who consented to registry linkage, there were 810 primary cancer diagnoses in the ACR and 959 self-reports of first cancer post-enrollment, for a cancer status sensitivity of 92.1% (95% CI: 90.0-93.9) and PPV of 77.8% (95% CI: 75.0-80.4). Compared to common cancers, rare cancers had a lower sensitivity (62.8% vs. 89.6%) and PPV (35.8% vs. 84.5%). Participants with a rare cancer were more likely to report an incorrect site than those with a common cancer. Rare cancers were less likely to be captured by active follow-up than common cancers. While rare cancer research may be feasible in large cohort studies, registry linkage is necessary to capture rare cancer diagnoses completely and accurately.


Introduction
Rare cancers account for approximately 25% of cancer cases in Canada and contribute disproportionately to cancerrelated morbidity and mortality [1,2]. However, several challenges including small sample size, diagnostic uncertainty, a lack of knowledge and expertise, and high costs impede progress in the study of rare cancers with conventional clinical research designs [3][4][5]. Large observational cohort studies, such as the Canadian Partnership for Tomorrow's Heath (CanPath), offer an opportunity to study the etiology and natural history of rare diseases with a sufficiently large sample size [5,6]. CanPath is a collaboration effort between six regional cohorts, covering 9 of the 10 Canadian provinces. It is Canada's largest volunteer research participant cohort, with over 330,000 participants enrolled to date [7]. Information in large observational cohort studies is often self-reported. Self-reported cancer status should be valid to produce useful results. If self-reported data are not valid, cohort linkage to population-based registries is a necessary step to utilize cohort data for rare cancer research. If selfreported data are deemed valid, then the cohort may be used for etiologic research in the absence of cancer registry linkage.
Most relevant in the evaluation of the validity of selfreported cancer diagnosis is the sensitivity and positive predictive value (PPV). Both a low sensitivity (high false negatives) and a low PPV (high false positives) imply a high likelihood of disease status misclassification and have the potential to bias the results of a study. Specificity and negative predictive value (NPV) are consistently very high in cancer self-report validation studies (both >90%) [8] due to a low cancer prevalence rate; most people do not get cancer, and there are many true negatives relative to false positives and false negatives.
To inform rare cancer researchers considering the Can-Path cohort, we conducted a study to investigate the sensitivity and PPV of self-reported primary cancer diagnoses and factors associated with self-report validity. Cross-provincial data sharing agreements currently restrict the linkage of Can-Path to the Canadian cancer registry for the use of researchers [7,15]. Thus, our study was confined to data from one of the six regional cohorts that make up the CanPath, Alberta's Tomorrow Project (ATP), where the authors are based.

Study Population.
The Alberta Tomorrow Project (ATP) is a cohort established in the Canadian province of Alberta, enrolling volunteers between the ages of 35 and 69 with no history of cancer, except nonmelanoma skin cancer [16]. Participants for this study were recruited in Phase 1 of ATP's recruitment, from 2000 to 2008, via random digit dialing [17]. There were 31,203 participants who completed the baseline Health and Lifestyle Questionnaire (HLQ) at the time of their enrolment. There were 360 participants that did not consent to data linkage (i.e., did not provide personal health care number) and were excluded from this study.

Data
Sources. Data was obtained from ATP and the Alberta Cancer Registry (ACR). Self-reported cancer diagnosis was collected from the first ATP follow-up survey it was reported: Survey 2004 (S04), Survey 2008 (S08), Updated Health and Lifestyle Questionnaire 2009-2011 (UHLQ), or CORE 2011-2015 (CORE) [16]. These surveys also collected personal and lifestyle information, such as education, smoking status, family health history, place of birth, and other health factors [16].
Alberta Cancer Registry data was linked to ATP data before dispensing the data to the study team. ACR and ATP data were matched on personal health care number and confirmed on first name, last name, and date of birth [16]. Records are linked when a perfect match is found. The ACR is a population-based registry that records topography, morphology, and behavior using ICD-O-3. The ACR has achieved a gold standard from the North American Association of Central Cancer Registries (NAACCR) for complete, accurate, and timely data for many years [18]. Cancer diagnoses that are not mandated to be reported to the Canadian Cancer Registry (CCR), such as behavior 2 cervix and prostate cancers and nonmelanoma skin cancers [19], were excluded to facilitate using the results to explore the utility of linkage to the CCR. Cancer diagnoses with behavior code 0 (benign) or 1 (borderline) diagnoses were also excluded as these were generally not considered "cancers." In an exploratory analysis, benign and borderline tumors accounted for less than 3% of all registry diagnoses and less than 40% were self-reported. Ethics approval was obtained from the Health Research Ethics Board of Alberta (Study ID CC-16-0880).
2.3. Cancer Classifications. ACR cancer site was generated from ICD-O-3 topography codes, using cancer site categories from the Surveillance, Epidemiology, End Results Program (SEER) 2018 classification scheme [20]. Categories that we did not expect to be differentiated in self-reports were collapsed: (1) corpus uteri and uterus, NOS (not otherwise specified) were collapsed to a single "Uterus" category; and (2) oropharynx, nasopharynx, hypopharynx, and pharynx were collapsed into a single "Throat" category. Only ACR diagnoses that occurred within an individual's follow-up time were included, as this diagnosis would have had the opportunity to be reported in their survey(s).
The SO4, SO8, and UHLQ surveys asked participants to record the cancer type in open text. The CORE survey had a dropdown menu to select from a list of cancers corresponding to the SEER site categories and an open text option for other types. Skin cancer responses that did not specify "melanoma" were considered nonmelanoma skin cancer and not included as a self-reported cancer diagnosis. The first instance of self-reported cancer type was categorized into an appropriate site category from the SEER 2018 scheme. SEER cancer categories and corresponding self-report site are described in Supplementary Table S1. Two analysts (EM and TS) independently categorized the first instance of self-reported cancer-type from each participant into a SEER 2018 category with 99.3% (95% CI: 98.9, 99.6%) agreement. Disagreements on cancer type were resolved by consensus.
Through ACR linkage, we identified 118 participants who had a cancer diagnosis prior to their enrollment in ATP. Considering the intention of ATP to enroll only those without a history of cancer, analyses were carried out with and without these 118 participants who had prior cancer diagnoses to see whether these participants may affect selfreport validity of incident cancer diagnoses.
2.4. Measure of Accuracy. Rare sites were defined as those that had an age-standardized incidence rate < 15/100, 000/ year in Canada based solely by site in a recent analysis [1]. Colon cancers, blood/hematopoietic/bone marrow cancers, and lymphatic cancers were grouped together for this analysis; these three site categories were common. Cancer diagnoses in the ACR were used as the gold standard for estimating sensitivity and PPV. Sensitivity was defined as the number of true positives divided by the number of true positives (TP) and false negatives (FN). PPV was defined as the number of true positives divided by the number of true positives and false positives. Sensitivity and PPV were estimated for anycancer overall diagnosis, common or rare cancer site, and site-specific for each cancer site. The definitions of terms for each analysis can be found in Supplementary Table S2. 2.5. Data Analysis. Logistic regression was used to examine the factors associated with three outcomes: (1) incorrectly reporting a cancer diagnosis (FP), (2) failing to report a cancer diagnosis (FN), and (3) failing to report the correct cancer site. Covariates included sex, education, smoking status, family history, place of birth, and age (at cancer report for the first and third outcome and at last follow-up for the second outcome). ACR cancer diagnosis (common or rare) was also included as a covariate for the second and third outcomes. The most recently reported smoking status and family history prior to or at the time of cancer reporting were used. Data on place of birth was not collected at baseline, and so was only investigated in those who took S08, UHLQ, or CORE and reported this information. Covariates in univariate analyses with a p value <0.2 were included in the multivariable analysis for each outcome, and multivariable results were reported. All analyses were conducted using STATA IC version 15 [21].

Sensitivity and PPV of Self-Reported Cancer Diagnoses.
In the 30,843 ATP participants who consented to registry linkage, there were 3,187 primary cancer diagnoses in the ACR by 2018, of which 510 were rare cancers as defined in this report. There were 810 diagnoses that occurred during active participant follow-up time and included in this analysis. The large difference in ACR cancer cases and cases included in the study is mainly due to loss to follow-up. Table 1 summarizes the number of participants that completed each followup survey. For example, only 25% of participants responded to the CORE survey. Most participants with an incident cancer did not complete any follow-up survey after their diagnosis. Thus, they were lost to follow-up, and their diagnosis cannot be used in this validation study.
Of the 746 participants who reported a cancer and had a cancer in the ACR (true positives), 90.5% reported the correct site, 88.1% reported correct site and ± one year of diagnosis, and 68.2% reported the correct site and correct year of diagnosis (Table 3). Common cancers were reported more accurately overall, with 97.5% of 649 true positives reporting the correct site, 95.1% reporting the correct site and ± one year of diagnosis, and 73.3% reporting the correct site and year. These percentages remained relatively unchanged when skin cancer was excluded. Rare cancers were reported less accurately. Of 54 true positives, 77.8% reported the correct site, 74.1% reported the correct site and ± one year of diagnosis, and 61.1% reported the correct site and year. Removing cervix cancer had little effect on these percentages.

Sensitivity and PPV excluding Participants with a
Diagnosis of Cancer before Baseline. After excluding participants who had a cancer diagnosis prior to enrolment, there 3 Journal of Cancer Epidemiology   Table S1 for groupings. Only SEER 2018 cancer sites with >10 ACR diagnoses and/or >10 self-reported diagnoses were reported. b A participant had a diagnosis before baseline if their age of first cancer diagnosis in the ACR was before their age at baseline. c Number of diagnoses in the ACR. A "-" indicates there was <10. Common and rare diagnoses add up to overall. Bolded cancer site types do not add up to overall as the "Other" and "CNS/eye" groups are not included. d Sensitivities for groups with 10 or more ACR diagnoses are reported. A "-" indicates there were <10 ACR diagnoses. e PPV's for groups with 10 or more self-reported diagnoses are reported. A "-" indicates there were <10 self-reported diagnoses. f Includes SEER 2018 site categories of unknown, ill-defined, bones and joints, connective and soft tissue, and retroperitoneum and peritoneum. Of those who correctly reported that they had cancer, 90.4% also reported the correct site, 88.0% reported the correct site and within one year of diagnosis, and 68.3% reported the correct site and year ( Table 3). The site and year of diagnosis of common cancers were reported more accurately than rare cancers. Site-specific sensitivities and PPVs slightly improved or remained unchanged when excluding those with cancer before baseline, with one exception; the PPV of blood/hematopoietic cancers decreased slightly (Table 2).

Factors Associated with Incorrect Cancer Status or Site
Reporting. Of the 30,725 ATP participants that consented to registry linkage and had no history of cancer at baseline, there were 741 true positives, 198 false positives, and 48 false negatives for reporting a cancer diagnosis. Of the 741 true positives, 71 reported an incorrect cancer site. Predictors of false positive compared to true positive reporting and incorrect site compared to correct site reporting are presented in Table 4, adjusted for predictors significant in univariate analysis (p < 0:2). Education, family history, and sex were not significant predictors of false positive, false negative, or incorrect site reporting in univariate analysis (p > 0:2). For false negative compared to true negative reporting (N = 789), only smoking status was significant at p < 0:2 in univariate analysis. Former smokers had higher odds of not reporting a diagnosed cancer (OR (95% CI): 1.92 (0.99-3.72), p = 0:053) compared to nonsmokers. Current smokers also had higher odds of not reporting cancer than nonsmokers (OR (95% CI): 1.32 (0.49-3.51)), but this was not statistically significant (p = 0:585).
Older participants had higher odds of correctly reporting cancer than younger participants, but also had higher odds of incorrectly reporting cancer site (Table 4). Participants > 70 years of age at the time of report had the highest odds of incorrectly reporting cancer site compared to those <50 years, adjusting for smoking status and rarity of cancer site (OR (95% CI): 4.19 (1.29-13.6)). Smoking status was associated with incorrectly self-reporting cancer (Table 4). Compared to nonsmokers, former smokers had higher odds of reporting a nondiagnosed cancer (OR (95% CI): 1.59 (1.10-2.29)). Current smokers had higher odds than nonsmokers of reporting a nondiagnosed cancer (OR (95% CI): 2.05 (1.29, 3.24)). Smoking was not associated with incorrect site reporting. Finally, participants with a rare cancer had much higher odds of incorrectly reporting cancer site compared to those with a common cancer, adjusting for age and smoking status (OR (95% CI): 13.7 (7.60-24.5)) ( Table 4).
In participants who reported place of birth, those born outside of Canada had slightly lower odds of reporting a cancer that is not in the registry (OR (95% CI): 0.75 (0.46, 1.23)) and of not reporting a diagnosed cancer (OR (95% CI): 0.85 (0.35, 2.06)) compared to those born in Canada (Table 4). They had slightly higher odds of reporting cancer site incorrectly (OR (95% CI): 1.36 (0.70, 2.64)). However, none of the associations between the place of birth and selfreport validity were statistically significant.

Discussion
This study evaluated whether self-reported cancer diagnoses were a valid outcome measure among participants of a Canadian cohort study and compared the reporting of common and rare cancers. This study contributes to the limited information on the validity of self-reported cancer diagnosis in the Canadian population. The sensitivity and PPV for reporting overall cancer status, without considering the site, were similar to reports from the US and Australia [9,12,14]. PPV was lower than sensitivity; self-report was more likely to lead to misclassifying someone as having cancer when they did not (false positive) than to misclassifying someone who had cancer as not having cancer (false negative). Those that correctly reported that they had cancer were also likely to report the correct cancer site; however, the year of diagnosis was reported less accurately. This was also demonstrated in a US cohort by Bergmann et al., which found that 84% of the ATP: Alberta's Tomorrow Project; TP: true positive; excl.: excluding. a A participant had a diagnosis before baseline if their age of first cancer diagnosis in the ACR was before their age at baseline. b Overall TP does not equal common TP plus rare TP. An overall TP reported cancer and had cancer in the ACR, regardless of type. A participant with a common cancer in the ACR had to report a common cancer in order to be a common TP. Similar criterion defines a rare cancer TP. 5 Journal of Cancer Epidemiology overall true positives also reported the correct site and correct year of diagnosis within one year [12].
Common cancers were reported more accurately overall than rare cancers and, as expected, made up a majority of cancer cases in the cohort. Breast and prostate cancer, the two most common cancers in this cohort, had the highest sensitivity and PPV. These two cancers often have high self-report validity across self-report literature [9][10][11][12]. Rare cancers, however, had a lower sensitivity than common cancers and were less likely to be captured by self-report. Participants with a rare cancer were more likely to report an incorrect site than participants who had a common cancer, suggesting that rare cancer diagnoses are not well understood by participants. A logistic regression analysis supported this hypothesis; for those that correctly reported overall cancer status, the odds of reporting the site correctly were much higher among participants with a common cancer relative to those with a rare cancer. A possible explanation for this phenomenon may be that rare cancers often have less diagnostic precision [4,5]. Ambiguous diagnostic procedures or results may be more likely to result in an incorrect or absent self-report [10,11]. Due to the low sensitivity and low PPV of rare cancer sites reported here, it is unlikely that self-reports of rare cancer are a valid outcome measure and cancer registry linkage is necessary to capture these cases accurately. Registry linkage also provides more specific diagnosis information. Cancer research often requires narrower site categories than used in this analysis, or further information such as histology and/or cancer stage.
Cancer registry linkage not only provides a valid diagnosis but also serves as a passive follow-up to capture cases completely. There were 3,187 total cancer diagnoses that developed in our study cohort in the ACR, but only 810 occurred within active follow-up in the ATP and had the opportunity to be self-reported in a subsequent survey. Most ATP participants with cancer diagnoses did not have the opportunity to report in an ATP follow-up survey (i.e., diagnosis occurred after a participant's last survey was filled out).
Rare cancers accounted for approximately 16% of the total cases that developed in the cohort, but only 10.6% of the cases are within active follow-up. Though participants who developed both common and rare cancers were lost to active follow-up, those who developed a rare cancer were more likely to be lost. One possible explanation for the differential loss to follow-up is that participants who were diagnosed with a rare cancer may have a shorter survival time than those diagnosed with a common cancer and be less likely to report. In this cohort, 40% of participants with a rare cancer have died, while 20% of participants with a common cancer have died. For those who died, the median survival time was 3.7 years and 6 years for rare and common cancers, respectively.
Finally, relying on self-reported diagnosis of cancer for inclusion criteria assumes that participants will correctly state that they have no history of cancer at study entry. If participants are not cancer-free at baseline but still included in an etiologic study, results from this study may be biased. It is unclear why some participants did not report cancer history at baseline. One reason may be that participants did not disclose their previous cancer in order to enroll in the study, as they were aware that being cancer-free at baseline was an eligibility requirement of enrolment [17]. ATP uses the cancer registry to verify cancer history after enrollment and indicates whether a participant had a prior cancer (before baseline). However, the site of cancer(s) is not disclosed and whether cancer self-report(s) in the future survey(s) are an incident or previous diagnosis is not clear. Linkage to cancer registry provides more information on these individuals and their diagnosis for researchers whose results may be impacted by cancer diagnoses before baseline.
Cancer registry linkage improved the utility of the ATP cohort by allowing for valid, detailed, and complete cancer diagnosis data. Through the partnership of ATP and the five other regional cohorts, CanPath allows for a larger sample size and further exploration into rare cancers. Although most CanPath participants have consented to linkage with cancer  Journal of Cancer Epidemiology registry, these agreements are made within the regional cohorts [7], and administrative data cannot cross provincial boundaries without further data agreements [15]. Therefore, nationally linked data can only be obtained by applying separately for data access and registry linkage to each of the six regional cohorts. Easing access to nationally linked data would allow for better utilization of the potential CanPath has to offer in the study of rare cancers. Alternatively, allowing regional cohorts to pass along validated, and perhaps more detailed, cancer diagnosis data to the CanPath would limit the reliance on self-reported cancer diagnosis.
Using the ACR as a gold standard strengthens this analysis due to the demonstrated completeness and accuracy in reporting [18], though the possibility remains that a true cancer case was not recorded in the ACR, resulting in a false positive. There are several other limitations in this analysis. Firstly, a lack of diagnoses within active follow-up did not allow for the separate reporting of some individual sites (e.g., stomach, small intestine, liver), but general anatomical sites were still reported (e.g., digestive/hepatic). Secondly, excluding nonmelanoma skin cancer and behavior 2 cervix and prostate cancer, as per CCR reporting guidelines, likely impacted the estimated sensitivity and/or PPV of these sites. Self-reports of skin cancer were only included in the analysis if "melanoma" was specified. This likely underestimated true self-reported melanoma cases. Cervical cancer had a very low PPV; there were few (<10) ACR diagnoses of behavior 3 cervical cancer, but there were many self-reports (false positive). These self-reports were likely due to the large number of behavior 2 cervix cancer cases in this cohort (ACR, n = 455), many of which occurred before the baseline. Finally, the researchers should use caution in generalizing these results to the Canadian population. Participants who enroll and are followed up in this cohort are likely healthier or more health conscious than the general population [17], which could introduce a "healthy volunteer effect" [22]. However, these results are likely to be generalizable to other large observational cohorts with similar aims in health research.

Conclusions
While self-reported diagnosis is reasonably valid for some common cancer types, other cancer types, particularly rare cancers, require registry linkage to be captured completely and accurately. In order to minimize bias and loss of followup in the use of cohort data for rare cancer research, linkage to cancer registry is necessary. Efforts to remove barriers to cross-provincial data sharing in Canada are ongoing and are needed to allow researchers to conduct valuable research on rare cancers that these national cohorts and registries offer.

Data Availability
Data can be made available upon request and in compliance with the ATP's Disclosure Policy. Information on accessing ATP's data can be found at https://myatpresearch.ca/.

Disclosure
The views expressed herein represent the views of the author(s) and not of Alberta Health Services, Health Canada, or any other of Alberta's Tomorrow Project's funders.