The Impact of Diagnostic Code Misclassification on Optimizing the Experimental Design of Genetic Association Studies

Diagnostic codes within electronic health record systems can vary widely in accuracy. It has been noted that the number of instances of a particular diagnostic code monotonically increases with the accuracy of disease phenotype classification. As a growing number of health system databases become linked with genomic data, it is critically important to understand the effect of this misclassification on the power of genetic association studies. Here, I investigate the impact of this diagnostic code misclassification on the power of genetic association studies with the aim to better inform experimental designs using health informatics data. The trade-off between (i) reduced misclassification rates from utilizing additional instances of a diagnostic code per individual and (ii) the resulting smaller sample size is explored, and general rules are presented to improve experimental designs.


Introduction
Clearly, a wealth of important clinical information is contained within large electronic health record (EHR) systems. Such information can be an invaluable resource for measuring disease prevalence [1] and disease comorbidity [2], the association between birth month and disease susceptibility [3], the prediction of outcomes [4], the measurement of economic impact of health care [5], and the discovery of etiological factors [6]. A key feature of these data is in the diagnostic codes given by medical professionals to patient records. However, the accuracy of inferring disease phenotypes from electronic diagnostic codes can vary widely across diseases and is often subject to high degrees of error [7][8][9][10]. These studies have noted the substantial misclassification effects from the use of electronic diagnostic code data, sufficient to undermine experiments utilizing cases and controls defined by the International Classification of Diseases (ICD) codes alone. The ICD coding system is instituted by the World Health Organization and has been adopted in the United States by the National Center for Health Statistics. More sophisticated approaches to disease classification, such as those using a variety of EHR data and machine learning methods, are difficult to generalize across all diseases and implement in a high-throughput manner. That said, I anticipate that machine learning methods applied to problems of phenotype prediction using EHR variables as features in the predictive modeling will eventually supplant the sole use of ICD code data. Until that time, the use of ICD data may still have utility in initial screens, to be subsequently validated through methods with higher positive and negative predictive values.

Related Work
In a general setting, the effect of phenotypic misclassification on statistical power of genetic association studies has been previously explored [11][12][13][14]. Edwards and colleagues characterized the noncentrality parameter in asymptotic power distributions given the presence of phenotypic misclassification [11]. The authors use cost functions to capture the effect of misclassification and show that the cost of misclassifying a control as a case becomes large and the cost of misclassifying a case individual as a control becomes small as the disease prevalence becomes small. Similarly, Ji et al. also investigated the calculation of a noncentrality parameter capturing phenotype errors for subsequent use in a likelihood ratio test for genetic association studies [12]. Later, Gordon and colleagues showed how to incorporate misclassification error rates into a trend test for genetic association in case/control studies [13]. More recently, Manchia and colleagues investigated the impact of heterogeneity within a clinical phenotype on genetic association [14].
Considering ICD data with misclassification, the type I and type II error rates for genomic association studies were recently thoroughly explored by Duan et al. [15]. The Duan et al. study found little inflation in false-positive rates, but not in considerable false-negative rates under certain allele frequency, effect size, and disease prevalence parameters. In the context of initial screens of ICD codes in EHR systems, several studies have investigated the relationship between the number of instances of particular ICD codes and the measures of diagnostic utility [1,[16][17][18]. In general, the accuracy of diagnoses improves with the number of instances of the code; however, this is at the expense of smaller sample sizes/increasing false negatives. Hence, there is a trade-off between type I and type II error rates with the number of ICD code instances used to define a disease. In this work, I investigate this trade-off and provide a framework for determining highly powered EHR-based experimental designs using diseases defined by different numbers of instances of ICD codes.

Materials and Methods
For a large genetic association scan of using ICD data, define a simple disease classification scheme such that cases are those individuals with x instances of a particular ICD code. Consider a design where individuals with ambiguous numbers of instances (i) of the code (i.e., 0 < i < x) are excluded from the analysis. Further consider a comparison of welldefined cases (i.e., those with at least x instances) against a large, fixed set of controls. With regard to the genetics, restrict the methods to biallelic markers with minor alleles segregating in the population at a frequency of at least 1% single-nucleotide polymorphisms (SNPs). Define the alleles at a SNP contributing to the susceptibility of the disease as A 1 and A 2 . Let the relative risk of the minor allele, A 2 , be R, such that R = P A 2 |cases P A 2 |controls −1 . Let the frequency of A 2 in the general population be q. Accordingly, 1 − q is the frequency of A 1 . Define n x as the number of cases obtained from the definition of having at least x instances of the ICD code being evaluated. Set the number of controls as m, such that m ≫ n x . Assume that the A 2 frequency in controls is approximately q. Model the decrease in the misclassification proportion within cases as x increases with a monotonic function f x , such that the expected number of truly positive cases is n x 1 − f x . The form of f x may vary considerably for different ICD codes. Lastly, let α be the statistical threshold for determining a positive finding in analyses where p value < α. The statistical test of genetic association considered is the binomial test of proportions which evaluates the null hypothesis of no correlation between the frequency of A 2 and the disease status.
Statistical power will be used to evaluate the impact of increasing x and the resulting experimental design. Under the model specified above, the power to detect association at an autosomal SNP, 1 − β, is calculated by the approximation as follows: where Φ is the standard Gaussian cumulative distribution function, z is the inverse standard Gaussian score, N = 4n x m/n x + m, and q and s are the A 2 frequencies in controls and cases, respectively. Using Bayes' theorem, the expected frequency of A 2 within cases under the misclassification model is given by To model the decrease in the misclassification rate with increasing numbers of ICD code instances, consider the simple decay function for f x : where δ is the parameter that can be estimated for each ICD code. Similarly consider the following form for n x as a function of n x=1 to model the reduction in the number of cases defined by using increasing numbers of instances of an ICD code: where ε is the parameter that captures the rate of decline in case numbers as the definition for case status becomes more stringent with the use of larger numbers of ICD code instances and can also be estimated for each ICD code. The machinery is now in place for the calculation of statistical power to detect disease association at a genetic marker using data from linked ICD coding systems.

Results and Discussion
The above model is used to conduct an exploration of the impact of ICD code definitions on power. To obtain a value of x which maximizes power to detect genetic association, one can numerically solve the following differential equation for x: The solution to (5) can be solved through standard numerical methods applied to solving The closest integer value to the value of x that solves this continuous equation can be used to optimize the power for a given set of parameters. To exemplify the use of this approach, let m = 10 000, n x=1 = 400, R = 2, q = 0 20, δ = 0 15, and ε = 0 15. Call this set of parameters the baseline model. x = 7 2265 solves the differential equation. Therefore, using seven instances of an ICD code will yield the optimal design weighing the trade-off between the case sample size and the misclassification. For that set of parameters, Figure 1 shows the power curve for this set of parameters.
To investigate the power curves, varying the baseline number of cases (n x=1 ), the calculations were performed as the n x=1 varied from 100 to 800. Visual inspection shows the peak of power at approximately 7 instances. Figure 2 shows the results.
Next, to determine the role of the δ and ε parameters on the power curves, the calculations were performed fixing the other parameters. Figures 3 and 4 display these results.

Conclusions
Genetic data linked to longitudinal electronic health records can serve as a very useful tool in modern disease genetics. However, misclassification present in ICD coding systems can severely hamper large-scale screens using those codes for the purpose of genetic association studies. This work has described a simple approach to better understand the impact of misclassification present in EHR systems for the purpose of optimizing experimental designs that screen numerous ICD codes in genetic association studies. Under the mathematical models considered, the methods offer an approach to select the number of instances of an ICD code for the purpose of defining cases and obtaining an optimal experimental design for the identification of genetic markers. Additional work is needed in this area to improve disease

Disclosure
The content is solely the responsibility of the author and does not necessarily represent the official views of the National Institutes of Health.

Conflicts of Interest
The author declares that he has no conflicts of interest.  To explore the effect of the delta parameter on the power calculations, the baseline model was modified to include values of delta from 0.01 to 0.30. The power to detect genetic association was calculated across these delta parameter values.