Epidemiological studies often utilize stratified data in which rare outcomes or exposures are artificially enriched. This design can increase precision in association tests but distorts predictions when applying classifiers on nonstratified data. Several methods correct for this socalled sample selection bias, but their performance remains unclear especially for machine learning classifiers. With an emphasis on twophase casecontrol studies, we aim to assess which corrections to perform in which setting and to obtain methods suitable for machine learning techniques, especially the random forest. We propose two new resamplingbased methods to resemble the original data and covariance structure: stochastic inverseprobability oversampling and parametric inverseprobability bagging. We compare all techniques for the random forest and other classifiers, both theoretically and on simulated and real data. Empirical results show that the random forest profits from only the parametric inverseprobability bagging proposed by us. For other classifiers, correction is mostly advantageous, and methods perform uniformly. We discuss consequences of inappropriate distribution assumptions and reason for different behaviors between the random forest and other classifiers. In conclusion, we provide guidance for choosing correction methods when training classifiers on biased samples. For random forests, our method outperforms stateoftheart procedures if distribution assumptions are roughly fulfilled. We provide our implementation in the R package
Statistics is an art of inferring information about large populations from comparably small random samples. This is necessary because in practice it is most often impossible to receive measurements from all individuals in a population (e.g., due to organizational or cost reasons). In the clinical context, for example, one might aim to predict the risk for a certain disease based on clinical features for an entire population. The risk model will be derived from information from a much smaller random subsample of the population. When building such models, a common assumption is that the subsample follows the same distribution as the population the sample was taken from. This assumption, however, is not valid if the sample is not taken at random. In the epidemiological context, for example, this case occurs in the wellknown
An even more complex sample design appears in
Stratified random selection process of a twophase casecontrol study. Feature characteristics known about a whole finite population are typically features which are inexpensive to measure and called characteristics recorded in Phase 1. The expensive characteristics are recorded only in Phase 2—in the final sample
Exemplary cross table for data before (left) and after (right) the selection process of a twophase casecontrol study. There is a clear dependency between exposure and disease in the population. After the sampling process, this dependency vanishes completely for the final sample
In the situations described above, the sample follows a different distribution than the population. This can affect statistical analysis. In the general context, this issue is known as
This paper assesses, proposes, and compares approaches to correct for sample selection bias in complex surveys, especially in twophase casecontrol studies. Therefore, we focus on the binary outcome. Figure
Scheme of learning on biased learning data and predicting on unbiased test data. The classifier learns on four equally sized strata (complete biased learning data set) but predicts on a data set (unbiased population) of different sizes of the four strata.
This paper is structured as follows. We formalize sample selection bias and address the necessity of correction in Section
This section introduces general definitions and background information: a formal description of sample selection bias (Section
The following setup is similar to that of Zadrozny [
For the setup of the sample selection bias issue, let in addition
In general, a sample
Label bias: biasedness depends on
Feature bias: biasedness depends on
Complete bias: biasedness depends on
Under label bias,
Whenever there is sample selection bias, there are
In this paper, we will discuss the special case of twophase casecontrol studies and hence put them into the context of sample selection bias in this subsection.
The casecontrol study is an example for sample selection bias in the clinical context: Some diseases under investigation are very rare in the entire population. A random sample of study participants would contain very few cases of the disease. Statistical analysis would suffer from low precision and thus low power. In order to increase precision and power, the number of cases is enriched such that the proportion of cases and controls in the sample is identical. In particular,
Casecontrol studies are mostly used for investigating associations between disease and features. The underlying label bias does not alter the effect estimates in hypothesis testing for associations between disease and features. However, this is true only asymptotically, and there may be consequences in small sample scenarios. If one focuses on prediction, for example, via logistic regression, as we do in this paper, the intercept estimate can simply be adjusted as described in Rose and van der Laan [
In
When data is sampled as in onephase or twophase casecontrol studies, there are groups within which the selection probabilities are equal. These groups are called
For a population of size
If the features determining the selection probabilities are categorical, the data set can be partitioned into corresponding strata with equal selection probabilities. This is not the case if, for example, the feature causing the selection bias is continuous. In the categorical case, selection probabilities can be used for adjusting the distribution of the sample to the original distribution of the population.
Consider the selection probability
In our correction approaches, we will use
In this section, we describe, modify, and analyze IP weightincorporating classifiers which are designed for learning on an unbiased data set, when only a biased data set for learning is given.
All approaches adjust the given data set to correct for sample selection bias by reconstructing the original (unbiased) data structure before or while learning the classifier. We consider the classifier
The methods in this section were proposed in the literature and are partly modified for our purposes.
A drawback of costing in case of strata with a low number of observations is the following: There may be subsamples which do not contain observations from all strata, which implies that no classification rule can be learnt for the missing strata from those subsamples. For the purposes of this paper, we adjusted the costing algorithm by not taking into account such incomplete samples. This modification causes bias which we consider negligible.
In its original form, SMOTE generates synthetic observations for the minority class as follows: For fixed
We adapt SMOTE to the context of stratified random samples: Rather than enlarging only the minority class, we generate synthetic observations for all strata with
The approaches above aim to reconstruct the original data distribution in order to then learn a classifier on an unbiased sample. However, several aspects are not incorporated so far: IP oversampling replicates observations and by this biases the covariance structure within the strata. A correction for this biasedness should be provided. Similarly, modified SMOTE biases the data, especially for large weights
In this section, we propose two procedures which aim to conquer the problem of small strata by increasing the number of observations per stratum and at the same time estimate the covariance of the population appropriately. The idea behind both approaches is to exploit the fact that within each stratum
Let
When adding this noise, we want to retain important distribution characteristics of the respective stratum. As stated above, the stratified sample contains features
We seek two conditions to hold:
In order to make a corresponding correction method more robust, we repeat the noiseadding procedure and average over the models fitted on each of those repetitions. Algorithm
So far, we described seven ways to deal with sample selection bias: no correction, IP oversampling, IP bagging, costing, modified SMOTE, stochastic IP oversampling, and parametric IP bagging. This subsection compares their characteristics. They are summarized in the left part of Table
Properties and performance of correction approaches for logistic regression and random forest. The properties are as follows: (i) a correction attempt is made at all; (ii) the covariance structure of the learning data is attempted to be unbiased; (iii) learning is based on a data set containing a larger number
Correction approach  Properties according to Section 
Sufficient performance  

(i)  (ii)  (iii)  Logistic regression  Random forest  
No correction 





IP oversampling  ✓ 

✓  ✓ 

IP bagging  ✓  ✓ 

✓ 

Costing  ✓  ✓ 

(✓) 

Modified SMOTE  ✓  (✓)  ✓  (✓) 

Stochastic IP oversampling  ✓  ✓  ✓  ✓ 

Parametric IP bagging  ✓  ✓  ✓  ✓  ✓ 
In Sections
As described by Zadrozny [
We investigate two variants of this model: Once, all features enter the model just linearly. In a refinement, features are additionally included as all possible twoway interaction term combinations, not only in order to detect possible interaction effects but also to obtain more complex decision boundaries.
A bootstrap sample is drawn from the given learning data set.
A decision tree is grown by constructing recursive binary splits to the given data based on the features.
At each node only a subset of features is selected at random.
Steps (1) to (3) are repeated and all trees are averaged; class probabilities can be estimated as the relative frequency of the class of interest for a terminal node.
An essential step which is different from common bagging (cf. Section
So far, we have presented and developed strategies for fitting classifiers under complete bias. In this section, we investigate their performance when a sample from a twophase casecontrol study is given as learning data set but the test data is unbiased (i.e., it is a random sample from the population). We do this in a simulation study. After stating the setup in Section
For evaluating the performance of correction approaches on training samples from twophase casecontrol studies and unbiased validation data sets, we need three kinds of data sets: first, a biased learning data set stemming from a twophase casecontrol study; second, an unbiased large reference learning data set for comparison purposes (we refer to this data as
We started by generating the large unbiased population data set. To that end, we randomly sampled
Normal distribution:
Student’s tdistribution:
Poisson distribution:
Bernoulli distribution:
The distribution parameters were uniformly drawn from the following sets for
In order to also investigate more realistic distribution scenarios, we additionally generated and analyzed data sets with dependent features and features from different distributions. These studies yield similar results as the setting above and are described in the Supplementary Material of this paper (available online at
Given the covariates
In order to obtain a biased stratified sample, we simulated a twophase random selection process from the population (Figure
Test data sets of size
In fact, the four different distribution scenarios meet the Gaussian assumption in decreasing order: The normal distribution trivially fulfills it. The tdistribution is still continuous and symmetric so that the violation of the normality assumption may not get too severe. The Poisson distribution is discrete but approximately normal for
The goal of the comparison is to see whether correction approaches perform significantly better than not correcting. For each classifier, we fit a linear regression model with the AUC as target variable and the correction approach as covariate. The latter variable is dummycoded with “no correction” as reference category. An approach is determined to differ significantly from the noncorrection approach if its coefficient’s
The simulation study yielded the following results (see also Figures
Performance of correction approaches in logistic regression, measured by AUC. We fit a linear model for the AUC as influenced by the correction method (dummycoded, no correction as reference category). The graphic depicts 95% confidence intervals for the respective coefficients. The dashed line shows the intercept of the model (i.e., the mean AUC for no correction). The blue colored methods are newly proposed in this paper.
Performance of correction approaches in the random forest, measured by AUC. We fit a linear model for the AUC as influenced by the correction method (dummycoded, no correction as reference category). The graphic depicts 95% confidence intervals for the respective coefficients. The dashed line shows the intercept of the model (i.e., the mean AUC for no correction). The blue colored methods are newly proposed in this paper.
Performance of correction approaches in logistic regression with additional twoway interaction terms, measured by AUC. We fit a linear model for the AUC as influenced by the correction method (dummycoded, no correction as reference category). The graphic depicts 95% confidence intervals for the respective coefficients. The dashed line shows the intercept of the model (i.e., the mean AUC for no correction). The blue colored methods are newly proposed in this paper.
Performance of correction approaches in the naive Bayes classifier, measured by AUC. We fit a linear model for the AUC as influenced by the correction method (dummycoded, no correction as reference category). The graphic depicts 95% confidence intervals for the respective coefficients. The dashed line shows the intercept of the model (i.e., the mean AUC for no correction). The blue colored methods are newly proposed in this paper.
However, there were differences between classifiers concerning the success of correction approaches. We start by contrasting logistic regression and the random forest as this comparison is of our primary interest.
The overall result for logistic regression (Figure
For the random forest, the picture is rather different (Figure
Table
In order to obtain a more comprehensive picture of the benefit of correcting for sample selection bias, we applied the correction methods in combination with two more classifiers, logistic regression with additional twoway interaction terms in addition to the linear terms and naive Bayes, leading to the following results.
Logistic regression with interaction terms yields a similar picture as standard logistic regression (Figure
For naive Bayes (Figure
This section investigates the performance of the correction methods in a real data example. Other than in the synthetic data situation in the previous section, we do not know the true distribution of the entire population here. In order to still be able to evaluate the predictions appropriately, we chose a very large real data set from which we could extract a small stratified learning set and a large unbiased test set as described in the following.
Normal quantilequantile plots for main features
Cross table for the
We trained all methods on the biased learning data and evaluated them on the unbiased test data. The resulting AUCs are compared by seven pairwise hypothesis tests according to [
The real data results confirm the findings from the simulation study. For logistic regression, all weighting approaches perform very similarly, which was significantly better than the nonweighting approach and even comparable to learning on a large population (Figure
Performance of logistic regression on real data. The graphic depicts 95% confidence intervals for the respective AUC value calculated and on the basis of [
Performance of random forest on real data. The graphic depicts 95% confidence intervals for the respective AUC value calculated and on the basis of [
Performance of logistic regression with all twoway interaction terms on real data. The graphic depicts 95% confidence intervals for the respective AUC value calculated and on the basis of [
Performance of naive Bayes on real data. The graphic depicts 95% confidence intervals for the respective AUC value calculated and on the basis of [
For random forest, we obtain similar results as in the simulation study (Figure
Also, for logistic regression with interaction terms and naive Bayes, we obtain results matching with the simulation study: The assumptions for normality are met only roughly for the real data, in which case the correction approaches all perform similarly and better than no correction (Figure
We investigated how to learn classifiers on stratified random samples as resulting from twophase casecontrol studies. Here, our emphasis was on random forest classification since previous bias correction methods did not pay special attention to resamplingbased classifiers. However, we studied a broad range of classification techniques. This work hence guides the choice of such approaches also for other classifiers. The methods are immediately applicable due to the implementations provided in our R package
Both our simulation study and the real data application show that for classifiers trained on biased data sets prediction on unbiased data sets can be improved if the stratification process is taken into account and corrected for. However, stateoftheart correction approaches from classical statistics (IP oversampling, IP bagging, costing, and modified SMOTE) do not yield the desired improvement for random forests. In fact, they can even lead to worse AUC values than those obtained when not performing any correction. From our two proposed approaches (stochastic IP oversampling and parametric IP bagging), on the other hand, the latter could always outperform the noncorrection approach.
We were also interested in all correction approaches’ success when employed in the context of logistic regression. It turned out that any method improves prediction on an independent data set as compared to no correction, and all correction techniques perform similarly.
Table
Having compared correction methods in random forests and in logistic regression, one may conclude that the choice of parametric IP bagging is advisable whenever the distribution assumptions for this approach are met. In order to once more revise this conclusion, we investigated the behaviors of all correction approaches in two more classifiers, a logistic regression model with additional interaction terms and the naive Bayes classifier. For the logistic regression model with interaction terms, once again only the parametric IP bagging consistently outperformed the noncorrection approach. For naive Bayes, all approaches performed similarly among each other, confirming the above stated rule.
Against our expectations, naive Bayes failed in the simulation study for the normal distribution scenario but did well for all other distributions. A generally unexpected result was the poor accomplishment of stochastic IP oversampling. It performed worse than noncorrection in several scenarios and was successful only in those situations where all other correction approaches were successful as well.
For a random forest, parametric IP bagging is an effective technique for prediction on an unbiased data set and can also be preferred for other classifiers. However, in this paper, we restricted our simulations and real data example to the case where the main features could be assumed to be roughly normally distributed (after transformation, if necessary) so that the assumption of a multivariate normal distribution was appropriate. The success of parametric IP bagging generally depends on meeting the assumptions about the distributions of the features. Hence, the method should be chosen with care. On the other hand, our simulations show that, even in scenarios where assumptions are barely met (e.g., for Poisson distributed features), the approach still works. Clearly, one could also adjust the distribution family for the parametric bootstrap in parametric IP bagging. Even mixture distributions are conceivable (e.g., for bimodal feature distributions).
So far, parametric IP bagging has not been designed for binary or categorical main features or combinations of different types. This could be done by subgrouping the corresponding categories (or combining categories in the case of several categorical features) and estimating parameters in each of the subgroups for the assumed distribution family analogously to what we did for the different strata. Again, one would draw parametric bootstrap samples within all subgroups and construct a new unbiased sample within the scope of parametric IP bagging.
Even though our new approaches were developed for the random forest, they are generally tailored towards learning by any classifier and can be incorporated in other machine learning algorithms. Parametric IP bagging has been shown to perform well even if theoretical assumptions are not met. It can be applied on any stratified random sample and is not restricted to twophase casecontrol studies. More generally, it is suited for any sample suffering from sample selection bias where the stratum features are categorical and the remaining features roughly follow a multivariate distribution from which parametric bootstrap samples can be drawn. For general classifiers, its performance is mostly comparable to that of other correction methods. Parametric IP bagging is the first correction method designed for the random forest and in that context clearly outperforms all other approaches.
Label bias does not imply that
Let
Analogously, one can show that feature bias does not imply that
Here, we derive an appropriate noise covariance matrix to be added to the features
For one stratum
For IP oversampling, we replicate the data points by the factor
We can simplify
We can estimate the components of the covariance matrix
In terms of random variables, the empirical covariance matrix combining all entries
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Christiane Fuchs and Fabian J. Theis are supported by the German Research Foundation (DFG) within the Collaborative Research Centre 1243, Subproject A17.