An Investigation of the Significance of Residual Confounding Effect

Background. Observational studies are commonly conducted in health research. However, due to their lack of randomization, the estimated associations between the outcome and the exposure can be affected by unmeasured confounding factors. It is important to determine how likely a significant association observed between an outcome variable and a noncausally related exposure may be introduced by residual confounding factors. Methods. A simulation approach is developed based on the sufficient cause model to test the likelihood of significant associations observed between a noncausally related exposure and the outcome. Results. Based on the estimates from all 500 replicates, the association between the exposure and the outcome is found to be significant in 386 (77%) replicates when all confounders (component causes) are controlled for in the model. However, when a subset of real component causes and some noncausal factors are controlled for in the model, the association between exposure and the outcome becomes significant in 487 (97%) replicates. Conclusion. Even when all confounding factors are known and controlled for using conventional multivariate analysis, the observed association between exposure and outcome can still be dominated by residual confounding effects. Therefore, an observed significant association apparently provides limited evidence for a causal relationship.


Introduction
Ethical and budgetary constraints often limit the application of experimental study designs in health research, so that observational studies such as cohort or case-control studies have been widely undertaken as methodological alternatives [1][2][3][4][5]. However, due to the lack of randomization, the estimates so obtained can be influenced by uncontrolled or unmeasured confounders and typically, the confounders bias estimates from their true values [6][7][8][9][10][11][12]. According to the epidemiological literature, a confounder must meet the following conditions: (i) being a cause of the disease, or a proxy of cause(s), in unexposed people; (ii) being correlated with exposure in the study population; (iii) not being an intermediate step in the causal pathway between the exposure and the disease [1,[13][14][15][16]. To deal with confounding effects, known or suspected confounders are measured together with the exposure and outcome of interest. Multivariate analyses are then performed to measure the association between the exposure and the outcome while attempting to remove the effects of such known or suspected confounders [8,13,[17][18][19].
Under the sufficient cause model, a sufficient cause means a complete causal mechanism, which can be defined as a combination of minimal conditions (necessary elements) and events that inevitably produce disease, while the necessary elements that constitute a sufficient cause are component causes [2]. It is common that component causes and compositions of sufficient causes are unknown, with simultaneous existence of measurement errors, misclassifications for exposures, confounders, and outcomes [8,[20][21][22][23]. Consequently, the estimated associations between the outcome and the exposure remain likely to be affected by unmeasured confounding factors. For example, even in well-designed studies, significant protective associations occurred between true nonprotective exposures and outcomes are actually caused by unmeasured confounding factors [24,25]. It is thus important to investigate how likely a significant association observed between an outcome variable and a noncausally 2 BioMed Research International related exposure may be introduced by residual confounding factors. In this study, we develop a simulation approach to test the likelihood of observing significant associations between a noncausally related exposure and the outcome variable based on standard multivariate analysis, given that the compositions of sufficient causes are not recognized, but either all risk factors/component causes are known and controlled, or only some of the risk factors/component causes are known and controlled. There are two objectives: (1) to investigate the likelihood of false positive observations in observational studies, (2) to propose a simulation framework for assessing epidemiologic methods which deal with confounding effects.

2.1.
Overview of the Simulation. The simulation process follows the sufficient cause model [2]. For an event to occur, at least one sufficient cause has to occur. The components of a sufficient cause are randomly chosen from a pool of low to moderate correlated variables, which include the exposure of interest and 99 other variables. The exposure of interest is set to be noncausal for the outcome and therefore it will never be chosen as a component for a sufficient cause. Given the correlation among the 100 variables, every chosen variable is a potential confounding factor for the association between the exposure and the outcome. The association between the exposure and the outcome is then estimated using a logistic regression model, while controlling for (i) all component causes; and (ii) some of the component causes (selected at random). The simulations contain 500 replicates, with each replicate being generated through an independent process. All simulations are performed using the STATA package release 12. The procedures involved in each replicate are outlined below. Details of the simulation procedure, including the sufficient cause model and the estimation process, are provided in the Appendix.
(2) Determine the composition of sufficient causes and the threshold values of components. The total number for the types of sufficient causes for is randomly chosen from (1, 2, 3, . . . , 9). Components for each type of sufficient causes are randomly selected from , , = (2, 3, . . . , 100). 1 is taken as the exposure, which is set to be noncausal for . For each observation, a sufficient cause is set to occur, when each of its components has a value higher than its specific threshold value. The threshold value is specific for each component as well as each type of sufficient cause, and it is randomly chosen from a uniform [0.5, 0.9) distribution. This allows the threshold values to vary between components as well as between different sufficient causes for the same component. To reflect the fact that exact threshold values are typically unknown, , are then dichotomized into binary form denoted by , , = (1, 2, 3, . . . , 100), = (1, 2, 3, . . . , 50000), by applying the following rule: , is set to 1 if , > 0.7, and 0 otherwise. Here, the mean 0.7 of a uniform [0.5, 0.9) variable is used instead of applying the exact threshold values, in order to account for unavoidable measurement errors and misclassifications in confounders and exposures.
(4) Generate small random errors for to represent measurement errors of outcome and to smooth the computing process. is a Bernoulli distributed random variable, being independent of and and only accounts for a small proportion of variance of . Details of steps 1 to 6 can be found in the Appendix.
(7) Estimate the effect of 1 on when all component causes are identified. There is no noncausal factor being mistaken as causal factor. We have where indicates whether is involved in at least one sufficient cause of , that is, = 1 if true and = 0 otherwise. Here, 1 and are the estimated effects of 1 and each of the component causes on , respectively. To estimate the effect of 1 on when only some component causes are known, and there are some noncausal factors being mistaken as causal factors, we have where indicates whether is "known" or suspected to be involved in at least one sufficient cause of , 1 , and are the estimated effects of 1 and each of the "known" risk factors on , respectively.

Results
Data obtained from replicate 1 is used as an example. Table 1 shows details of the sufficient causes and their components for replicate 1. Overall, the incidence rate (per 1000 observation units) for is 32.4, while it is 20.2 among unexposed observations ( 1 = 0) and 89.0 among exposed observations ( 1 = 1). This leads to an observed crude exposed-tounexposed risk ratio of 4.4, though the exposure is not causal for . Moreover, as shown in Table 2, the strength of association between exposure and confounders is considerably low, with low level of misclassifications for confounder status.  As described in the simulation design and the appendix, the total number of sufficient causes and the components of each possible sufficient cause vary between replicates and are determined by independent random process (i.e., sufficient cause A has two components: 17 and 50 ; sufficient cause B has three components: 7 , 29 , and 53 ). Given that all confounding factors (component causes) are controlled for in the model, the effect of exposure remained significant ( < 0.001). Table 3 suggests that the effect of exposure is further biased away from the null when only a subset of real component causes and some noncausal factors are controlled in the model. Based on the estimates from all replicates, the association between the exposure and the outcome is found to be significant in 386 (77%) out of the 500 replicates when all confounders (component causes) are controlled in the model. However, when a subset (rather than all) of real component causes and some noncausal factors are controlled in the model, the association between the exposure and the outcome becomes significant in 487 (97%) out of the 500 replicates.
In addition, Figure 1 indicates that when adjusting for all the real component causes, the significantly estimated effect of the exposure is on average substantially smaller than the effects of real component causes. The mean (standard deviation), 25th, 50th, and 75th percentiles of the significant coefficients (natural logarithm of the odds ratio) are 0.22 (0.17), 0.14, 0.18, and 0.25, respectively for the noncausal exposure and are 0.73 (0.79), 0.23, 0.42, and 0.927, respectively, for the real component causes.

Discussion
In observational studies, when a statistical significant association arises between an exposure and the outcome in the multivariate analysis, it is usually considered as supportive evidence for causal relationship [8]. We adopt the sufficient cause model in the simulation process to investigate how likely a significant association between the exposure and the outcome may be observed when there is no causal association between the two in an observational study setting. The results indicate that significant associations between the exposure and its noncausal related outcomes are presented in more than 70% of the situations, even when assuming that all confounders (causal factors) are known to researchers and controlled for in the multivariate analysis. In reality, many component causes of a disease are unknown [8,[20][21][22][23].
Moreover, results from the simulation study suggest that under the conventional multivariate analysis approach, residual confounding effects remain strong enough to influence the observed associations and an observed significant association provides only limited evidence for a causal relationship. Therefore, new methods are required to handle residual confounding effects. The simulation design adopted in this study can also serve as a platform to evaluate the performance of such methods. There are several advantages of our simulation design. Firstly, although all component causes and sufficient causes are determined through random process, they are all tracked and measured, unlike collected data where most pieces of information on component causes and sufficient causes are unknown and unmeasurable. Secondly, for specific exposures and outcomes, information from existing literature can be easily adopted into the simulation design. Thirdly, the simulation design can be adjusted to fit specific prior assumptions on the distributions and correlations among component causes and the exposure as well as compositions of sufficient causes. Hence it is possible to obtain estimates on the effects of the exposure under different prior assumptions.

Conclusion
This study demonstrates that even when all confounding factors are known and controlled for using conventional multivariate analysis, the observed association between exposure and outcome can still be dominated by residual confounding effects. An observed significant association apparently provides limited evidence for a causal relationship. (2) Determine sufficient cause compositions and their components.

Details of Steps 1 to 6 in Simulation Procedure
(i) Components for nine possible sufficient causes for are determined. Let , , = (1, 2, 3 . . . , 9), = (1, 2, 3, . . . , 100) indicate whether is a component of the th possible sufficient cause: if is component of the th possible sufficient cause, , = 1, and 0 otherwise. , takes on a random value drawn from the Bernoulli distribution with probability of success , which is derived (rescaled) from a gamma distribution with both shape parameter and scale parameter equal to 1. For each sufficient cause if the components are less than 2, that is, for a given if ∑ < 2, then all components are redetermined through the same random process. (ii) Determine whether a possible sufficient cause occurs. Let , = 1 when all components for the th possible sufficient cause become active or occur in the th observation; that is, ∑ , , , = ∑ , ; otherwise , = 0, = (2, 3, . . . , 100), = (1, 2, . . . , 9), and = (1, 2, 3, . . . , 50000). (iii) Choose real sufficient causes from the nine possible sufficient causes. Let , = (1, 2, . . . , 9) denote whether the th possible sufficient cause is a real sufficient cause for . If the th possible sufficient cause is a real sufficient cause, then = 1 and 0 otherwise. takes on a random value drawn from the Bernoulli distribution with probability of success 0.5. If there is no real sufficient cause assigned, that is, ∑ < 1, then the real sufficient causes for are redetermined through the same random process.
(3) Determine competing events. Let denote the competing events for outcome , = (1, 2, 3, . . . , 50000). is a Bernoulli distributed random variable with a probability of success 0.001, value of success (competing events occurred) being 1, and value of failure (competing events not occurred) being 0. is independent of .
is a Bernoulli distributed random variable with a probability of success 0.001, value of success being 1, and value of failure being 0. is independent of both and .  Let , = (2, 3, 4 . . . , 100) denote the researcher's knowledge (not necessary the fact) on in relation to its confounding effect on the association between 1 and . is a random value drawn from the Bernoulli distribution with a probability of success 0.1 + ∑ , /10, value of success being 1, and value of failure being 0. ∑ , is the total number of real sufficient causes that included as a component.