Down-sampling in Logistic Regression
This post is a repost of a question I asked on Cross Validated back in 2013. The answer is the foundation of my own understanding and beliefs about down-sampling in logistic regression and machine learning in general.
The Question
If I have a dataset with a very rare positive class, and I down-sample the negative class, then perform a logistic regression, do I need to adjust the regression coefficients to reflect the fact that I changed the prevalence of the positive class?
For example, let’s say I have a dataset with 4 variables: Y, A, B and C. Y, A, and B are binary, C is continuous. For 11,100 observations Y=0, and for 900 Y=1:
set.seed(42)
n <- 12000
r <- 1/12
A <- sample(0:1, n, replace=TRUE)
B <- sample(0:1, n, replace=TRUE)
C <- rnorm(n)
Y <- ifelse(10 * A + 0.5 * B + 5 * C + rnorm(n)/10 > -5, 0, 1)
I fit a logistic regression to predict Y, given A, B and C.
dat1 <- data.frame(Y, A, B, C)
mod1 <- glm(Y~., dat1, family=binomial)
However, to save time I could remove 10,200 non-Y observations, giving 900 Y=0, and 900 Y=1:
require('caret')
dat2 <- downSample(data.frame(A, B, C), factor(Y), list=FALSE)
mod2 <- glm(Class~., dat2, family=binomial)
The regression coefficients from the 2 models look very similar:
coef(summary(mod1))
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -127.67782 20.619858 -6.191983 5.941186e-10
# A -257.20668 41.650386 -6.175373 6.600728e-10
# B -13.20966 2.231606 -5.919353 3.232109e-09
# C -127.73597 20.630541 -6.191596 5.955818e-10
coef(summary(mod2))
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -167.90178 59.126511 -2.83970391 0.004515542
# A -246.59975 4059.733845 -0.06074284 0.951564016
# B -16.93093 5.861286 -2.88860377 0.003869563
# C -170.18735 59.516021 -2.85952165 0.004242805
Which leads me to believe that the down-sampling did not affect the coefficients. However, this is a single, contrived example, and I’d rather know for sure.
The Answer (by Scortchi)
Down-sampling is equivalent to case–control designs in medical statistics—you’re fixing the counts of responses & observing the covariate patterns (predictors). Perhaps the key reference is Prentice & Pyke (1979), “Logistic Disease Incidence Models and Case–Control Studies”, Biometrika, 66, 3.
They used Bayes’ Theorem to rewrite each term in the likelihood for the probability of a given covariate pattern conditional on being a case or control as two factors; one representing an ordinary logistic regression (probability of being a case or control conditional on a covariate pattern), & the other representing the marginal probability of the covariate pattern. They showed that maximizing the overall likelihood subject to the constraint that the marginal probabilities of being a case or control are fixed by the sampling scheme gives the same odds ratio estimates as maximizing the first factor without a constraint (i.e. carrying out an ordinary logistic regression).
The intercept for the population can be estimated from the case–control intercept if the population prevalence is known:
where & are the number of controls & cases sampled, respectively.
Of course by throwing away data you’ve gone to the trouble of collecting, albeit the least useful part, you’re reducing the precision of your estimates. Constraints on computational resources are the only good reason I know of for doing this, but I mention it because some people seem to think that “a balanced data-set” is important for some other reason I’ve never been able to ascertain.