A Proxy Outcome Approach for Causal Effect in Observational Studies: A Simulation Study

Background. Known and unknown/unmeasured risk factors are the main sources of confounding effects in observational studies and can lead to false observations of elevated protective or hazardous effects. In this study, we investigate an alternative approach of analysis that is operated on field-specific knowledge rather than pure statistical assumptions. Method. The proposed approach introduces a proxy outcome into the estimation system. A proxy outcome possesses the following characteristics: (i) the exposure of interest is not a cause for the proxy outcome; (ii) causes of the proxy outcome and the study outcome are subsets of a collection of correlated variables. Based on these two conditions, the confounding-effect-driven association between the exposure and proxy outcome can then be measured and used as a proxy estimate for the effects of unknown/unmeasured confounders on the outcome of interest. Performance of this approach is tested by a simulation study, whereby 500 different scenarios are generated, with the causal factors of a proxy outcome and a study outcome being partly overlapped under low-to-moderate correlations. Results. The simulation results demonstrate that the conventional approach only led to a correct conclusion in 21% of the 500 scenarios, as compared to 72.2% for the alternative approach. Conclusion. The proposed method can be applied in observational studies in social science and health research that evaluates the health impact of behaviour and mental health problems.


Background
Due to lack of randomization, estimates obtained from observational studies are often affected by uncontrolled or unmeasured confounding effects. Several methods have been proposed to deal with the problem [1][2][3][4][5][6][7][8][9][10][11], but their application relies on assumptions about the distribution of the unknown confounding factor(s) in relation to the outcome, the exposure, and other known covariates.
The present study investigates an alternative approach of analysis that makes use of field-specific knowledge to determine a proxy outcome, which then will be employed to estimate uncontrolled confounding effects. A proxy outcome should satisfy the following conditions: (i) the exposure of interest is not a cause for the proxy outcome; and (ii) causes of the proxy outcome and the main outcome are subsets of a collection of correlated variables. If condition one is satisfied, then it is certain that the observed association between the exposure and the proxy outcome is completely driven by confounding effects. Nevertheless, condition one may be relaxed to some extent, for example, when it is certain that the confounding effect is by far stronger than possible causal effect. When condition two is satisfied, confounders for the proxy outcome and the outcome of interest are similar or at least correlated. For example, various forms of physical and mental health outcomes can be affected by a cluster of socioeconomic, behavioural, psychological, and genetic factors [12][13][14][15][16][17][18][19][20][21][22]. Researchers can apply their field knowledge and experience to determine the best proxy outcome for their outcome and exposure of interest.
Intuitively, let and̃be the main outcome and the proxy outcome, respectively; denotes the possible risk of due to a given level of exposure; 0 and 1 the unexposed and exposed person-time at risk; 0 and 1 represent the number of cases of observed in 0 and 1 , whereas 0 and 1 are the number of cases of̃observed in 0 and 1 , respectively.

BioMed Research International
Further, suppose that ( 0 ) and ( 1 ) define the expected probability of one or more sufficient causes that the exposure of interest is not involved, occurring for in a unit of persontime at risk within 0 and 1 , respectively; correspondingly, ( 0 ) and ( 1 ) define the expected probability of one or more sufficient causes occurring for̃in a unit of persontime at risk within 0 and 1 , respectively. Then, the observed crude risk ratios are If the sufficient causes of and̃are the same, largely overlapped, or strongly correlated, then will be observed if the exposure is not causal for ( = 0), whereas will be observed if the exposure is causal for ( > 0). Given this strict assumption, the method has been successfully applied in time series analysis recently, whereby the proxy outcome was described as a control series [23,24]. For example, the study by Herttua and colleagues investigated the effect of alcohol price (exposure) on alcohol-related mortality (outcome of interest), while coronary operations were used as the control series (proxy outcome) [23]. Nevertheless, such a method may remain valid when assumptions are relaxed and can be applied to different study designs. Because the details of underlying causal mechanisms for an outcome (i.e., all sufficient causes) are typically unknown, it is best to ascertain the validity of the method through simulations. Therefore, we conduct a simulation study to test its application, focusing on situations when the causes of and̃are only partly overlapped and have only low to moderate strength of association.

Simulation Design.
The simulation process follows the sufficient cause model [25]. In the simulation for an event to occur, at least one sufficient cause has to occur, which comprises the occurrence of two matched causal components and absence of any competing event. In addition, a randomly distributed small error term is introduced to ensure that perfect prediction (which interrupts the computing process) will not occur. To account for the fact that only certain real causal factors are known yet some of the noncausal factors are mistaken as causal factors, a collection of variables are included to encompass exposure, causal factors, and noncausal factors, while a subset from the pool provides the known variables. All simulations are performed within the STATA package release 12.
(6) Determine the known/suspected (not necessary the fact) "causal" factors (except 1 ) for outcomes and . The known/suspected "causal" factors for outcomes , , and are determined for for = (1, 2, 3, . . . , 40). Let , denote the researcher's knowledge (not necessary the fact) of causes for outcomes and . Let = ( , ), and , indicates a known "causal" factor. Because 1 is the exposure of interest, so we force each , = 1 when = 1. For = (2, 3, . . . , 20), , is a random value drawn from a Bernoulli distribution with a probability of success = 0.5. For = (21, 22, 23, . . . , 40), , is a random value drawn from the Bernoulli distribution with a probability of success = 0.15, value of success = 1, and value of failure = 0. The difference in the success rates between the two groups indicates that a real causal factor is more likely to be acknowledged than a noncausal factor.
The true effects of 1 on outcomes and based on the fact model are estimated by where , is the estimated real effect of on outcome for = ( , ). To estimate the effects of 1 on outcomes and based on known/suspected confounders and applying standard multivariate logistic regression as the adjustment method, we have where , is the estimated effect of on outcome for = ( , ). To estimate the confounding effects on 1 on outcomes and using the proxy outcome , the logistic model for adjustment becomes wherẽ, is the estimated effect of on outcome (proxy outcome) for = ( , ).

Classification of Effect of 1 on and .
Based on the fact model, 1 increases risk of outcome if ,1 > 0.05 and value for ,1 < 0.05; otherwise 1 has no effect on . Also, 1 increases risk of outcome if ,1 > 0.05 and value for ,1 < 0.05; otherwise 1 has no effect on . The same effect patterns hold analogously for the conventional model and the alternative approach by replacing the regression coefficient ,1 with ,1 and̂, 1 , respectively. Classifications of the effect of 1 based on the fact model are then used as the gold standard to compare with the classifications based on the conventional approach and the alternative approach.

Empirical Application.
A simple example is provided to clarify the methodology. For additional illustration of the proxy outcome method to adjust for residual confounding effects, interested readers are referred to the first author's recently published research [26]. Briefly, when investigating the effect of alcohol use (exposure) on general health status (outcome), both measured and unmeasured confounding factors are involved. Many of these confounding factors are clustered within the family such as socioeconomic determinants, environmental factors, lifestyle, and genetic susceptibility. Although current alcohol use by adults does not produce any physiological effect on their children's current health, observed effect of current alcohol use (exposure) on their children's health status (proxy outcome) can be used as an approximation of confounding effects.
This example used the data from the 2010 National Health Interview Survey. A first logistic regression model was fitted to compare the likelihood of having undesirable (poor or fair) health status (outcome) between lifetime abstainers and current light drinkers (exposure). A second logistic regression model was then applied to compare the likelihood of having undesirable health status in the children (proxy outcome) in relation to the drinking status of their parents. To adjust for confounding effects, natural logarithm of the odds ratios from the second model was introduced as an offset variable into the first model.

Results
Estimates based on the knowledge and conventional model from one replicate are shown in Tables 1 and 2 as an example. Both outcomes and are treated as the outcomes of interest, while outcome is used as the proxy outcome. In this replicate, between outcomes and , there are four common causal factors. These account for 33% and 40% of all causal factors for and , respectively. Except for the exposure of interest ( 1 ), 54% of causal factors for are known. The true effect of 1 on outcome ( ,1 ) based on the fact model is 1.27 ( value < 0.001) indicating that 1 is a real causal factor to . The estimated effect based on the conventional approach ( ,1 ) is 0.79 ( value < 0.001). The estimated effect based on the alternative approach (̂, 1 ) Table 1: Data example of a replicate/scenario, estimated effects (coefficients from logistic models) of exposure ( 1 ), and known "causal"/confounding factors of on and proxy outcome .     is 0.79 − 0.09 = 0.70. Both the conventional approach and alternative approach lead to the same correct conclusion that 1 is a causal factor to . Between outcomes and there are also four common factors. These account for 67% and 40% of all causal factors for and , respectively. Except the exposure of interest ( 1 ), 33% of causal factors for are known. The true effect of 1 on outcome ( ,1 ) based on the fact model is −0.024 ( value = 0.668) indicating that 1 is not a causal factor to . The estimated effect based on the conventional approach ( ,1 ) is 0.12 ( value < 0.001). The estimated effect based on the alternative approach (̂, 1 ) is 0.12 − 0.16 = −0.04. Therefore, based on estimation from the conventional approach one would mistakenly draw the conclusion that " 1 is a causal factor to . " However, given that̂, 1 < 0.05, the alternative approach has led to the correct conclusion that 1 is not a causal factor to . Table 3 summarises findings from the 500 replicates. Based on the fact model, the exposure of interest ( 1 ) is classified as a causal factor for outcome in all replicates. This is in perfect agreement with the simulation process that 1 is set to be a causal factor for . In the simulation process, 1 is set to be a noncausal factor for ; in 97.4% of the 500 replicates, the fact model concludes that 1 is not a causal factor for , but in 2.6% of the replicates the fact model concludes that 1 is a causal factor for . The disagreement between the simulation process and the fact model in these 2.6% replicates is a result of type I error (setting the two-sided confidence interval to 95%). Nevertheless, in all replicates, both the conventional and alternative approaches have classified 1 as a causal factor for . When comparing the estimates between the conventional approach and the fact model, the two models have led to the same conclusion in only 21% of the replicates, while in 72.2% of the replicates, the alternative approach has led to the same conclusion as the fact model. When the simulation process is used as the gold standard for classification (i.e., 1 is causal for , but not causal for ), the sensitivity of the new approach is 100%, and the specificity is 70.2%. Table 4 presents results of the example. Alcohol use by adults has similar effects on their health status and their children's health status, whereas the effect on children's health status is mediated by confounding factors. To account for the uncontrolled confounding effects when estimating the effect of adults' alcohol use on their health status, an offset variable, which takes on the value of the natural logarithm of 1 for lifetime abstainers and 0.60 for current light drinkers, is added to the model. The adjustment changes the odds ratio from 0.54 ( < 0.001) to 0.90 ( = 0.38).

Discussion
In this study, we introduce a new analysis approach for causal effect. Although the new approach is only applicable to measurements of relative effects (i.e., risk ratios, odds ratios), it does not require any distributional assumption for the confounding variables in relation to the outcome, the exposure, and other known confounding variables. Instead, the approach merely assumes that the causes for the outcome of interest and proxy outcome are partly overlapped and correlated. The choice of an optimal proxy outcome is achievable by directly applying field expertise without advanced knowledge in statistics. The simulation results show that the alternative approach is far more accurate than the conventional approach in classifying causal associations, even under conditions of low to moderate correlation between the causes for the outcome and causes for the proxy outcome. The proposed approach appears to be suitable for observational studies in social science and health research that evaluate the health impact of behaviour and mental health problems, especially where clusters of causes for various outcomes are strongly correlated and overlapped in these fields [12,27,28].
It should be remarked that the analysis can only be performed when effects are measured by relative risk difference such as risk ratio or odds ratio. Another limitation is that false classification remains possible, even though the proposed method appears to have an advantage over the conventional approach. In this study, we demonstrate a new simulation process that incorporates the component causes, competing events, difference between the fact and the knowledge, to model realistic scenarios in observational studies. This simulation process could be further developed and used to determine how knowledge that deviates from the fact can introduce bias in estimates.

Conclusion
In conclusion, the proposed proxy outcome approach can be applied in observational studies in social science and health research that evaluate the health impact of behaviour and mental health problems.