Logistic regression for presence-absence data

regression

GLM

species distribution

Model species presence and absence with a binomial GLM in R: reading log-odds and odds ratios, fitting proportion data, and spotting the separation trap.

Author

Tidy Ecology

Published

2026-06-25

A previous post worked through count data, where the response is a non-negative integer and the natural model is a Poisson or negative binomial GLM. A lot of field data is not counts but yes-or-no: a species is present or absent at a site, a seed germinates or it does not, a quadrat is occupied or empty. That binary response needs a different member of the same GLM family, the binomial model with a logit link, and it is the workhorse behind presence-absence species distribution models (Guisan and Zimmermann 2000).

This post covers the three things that trip people up: reading the coefficients on the right scale, handling proportion data without resorting to a transformation, and recognising complete separation when it wrecks a fit. Everything runs in base R.

Presence along a gradient

Build a simple example: 200 sites along an elevation gradient, with presence probability rising as elevation increases.

set.seed(38)
n <- 200
elev <- runif(n, 500, 2000)              # metres
eta  <- -8 + 0.006 * elev                # the true log-odds
present <- rbinom(n, 1, plogis(eta))     # plogis() is the inverse logit
c(prevalence = round(mean(present), 2),
  elev_min = round(min(elev)), elev_max = round(max(elev)))

prevalence   elev_min   elev_max 
      0.46     523.00    2000.00

The response is present, a vector of zeros and ones. Fit it with glm and the binomial family, which uses the logit link by default.

m1 <- glm(present ~ elev, family = binomial)
round(summary(m1)$coefficients, 5)

            Estimate Std. Error  z value Pr(>|z|)
(Intercept) -8.74021    1.20100 -7.27744        0
elev         0.00665    0.00091  7.31023        0

Reading the coefficients

The slope is 0.00665, and it lives on the log-odds scale, which is not where most people want to think. Three transformations move between the scales a binomial GLM speaks. The raw coefficient is the change in log-odds per unit of the predictor. Its exponential is an odds ratio. And plogis of the linear predictor is a probability.

b <- coef(m1)["elev"]; se <- summary(m1)$coefficients["elev","Std. Error"]
# odds ratio for a 100 m increase, with a 95% interval
c(OR_per_100m = exp(100 * b),
  lo = exp(100 * (b - 1.96 * se)), hi = exp(100 * (b + 1.96 * se)))

OR_per_100m.elev          lo.elev          hi.elev 
        1.944396         1.626884         2.323876

# fitted presence probability at the ends of the gradient
plogis(predict(m1, data.frame(elev = c(523, 2000))))

          1           2 
0.005155522 0.989634096

Each additional 100 m of elevation multiplies the odds of presence by about 1.94, with a 95% interval of 1.63 to 2.32. Nearly a doubling of the odds per 100 m. The probability statement is different in kind: presence rises from about 0.005 at the bottom of the gradient to 0.990 at the top, but it does not rise linearly. The logit link makes the odds change by a constant factor everywhere, while the probability changes fastest in the middle of the curve and flattens at both ends. That non-linearity is the whole point of the link function, and it is why an odds ratio is a cleaner summary than any single statement about probability.

How good is the fit?

A binomial GLM has no residual sum of squares, so there is no ordinary R-squared. Two honest summaries are deviance-based pseudo-R-squared and a measure of how well the model ranks presences above absences.

mcfadden <- 1 - m1$deviance / m1$null.deviance
ph <- predict(m1, type = "response")
auc <- mean(outer(ph[present == 1], ph[present == 0], ">"))
c(null_deviance = round(m1$null.deviance, 1),
  residual_deviance = round(m1$deviance, 1),
  McFadden_R2 = round(mcfadden, 3), AUC = round(auc, 3))

    null_deviance residual_deviance       McFadden_R2               AUC 
          275.600           135.500             0.508             0.924

McFadden’s pseudo-R-squared is 0.508 here. It is not comparable to an ordinary R-squared, and values between 0.2 and 0.4 already indicate a good logistic fit, so a number this high reflects a strong, clean gradient. The AUC of 0.924 says that if you draw one occupied and one empty site at random, the model assigns the occupied one a higher probability 92% of the time. Both are worth reporting; neither should be read as if the model were a linear regression.

library(ggplot2)
pap <- "#f5f4ee"; ink <- "#16241d"; line <- "#dad9ca"
forest <- "#275139"; sage <- "#93a87f"; accent <- "#b5534e"
te_theme <- theme_minimal(base_size = 12) + theme(
  plot.background = element_rect(fill = pap, colour = NA),
  panel.background = element_rect(fill = pap, colour = NA),
  panel.grid.minor = element_blank(),
  panel.grid.major = element_line(colour = line, linewidth = .3),
  axis.title = element_text(colour = ink), axis.text = element_text(colour = "#5d6b61"),
  plot.title = element_text(colour = ink, face = "bold"),
  plot.subtitle = element_text(colour = "#5d6b61"),
  legend.title = element_text(colour = ink), legend.text = element_text(colour = "#5d6b61"))

xx <- data.frame(elev = seq(min(elev), max(elev), length = 200))
pr <- predict(m1, xx, type = "link", se.fit = TRUE)
xx$fit <- plogis(pr$fit)
xx$lo  <- plogis(pr$fit - 1.96 * pr$se.fit)
xx$hi  <- plogis(pr$fit + 1.96 * pr$se.fit)

ggplot() +
  geom_ribbon(data = xx, aes(elev, ymin = lo, ymax = hi), fill = sage, alpha = .3) +
  geom_line(data = xx, aes(elev, fit), colour = forest, linewidth = 1.1) +
  geom_point(data = data.frame(elev, present), aes(elev, present),
             position = position_jitter(height = .03), alpha = .4, colour = ink, size = 1.3) +
  labs(x = "Elevation (m)", y = "Presence probability",
       title = "A binomial GLM fits presence probability") +
  te_theme

Presence points cluster at high elevation and absence points at low elevation; a green S-shaped curve rises from near zero to near one across the gradient, with a shaded confidence band. — Figure 1: Observed presence and absence (jittered) along the elevation gradient, with the fitted logistic curve and its 95% interval.

Proportion data

The binomial model is not only for single zeros and ones. If you visit each site and record how many of several quadrats are occupied, the response is a proportion, and the same family handles it. Pass the counts as a two-column matrix of successes and failures rather than as a raw fraction.

set.seed(74)
ns <- 120
moisture <- runif(ns, 0, 1)
nq <- sample(5:20, ns, replace = TRUE)        # quadrats per site, unequal
k  <- rbinom(ns, nq, plogis(-2 + 4 * moisture))

m2 <- glm(cbind(k, nq - k) ~ moisture, family = binomial)
round(summary(m2)$coefficients, 3)

            Estimate Std. Error z value Pr(>|z|)
(Intercept)   -2.053      0.127 -16.185        0
moisture       3.993      0.225  17.739        0

The fit recovers the generating values closely (intercept near -2, slope near 4). The two-column response matters because it carries the sample size: a site with 20 quadrats occupied 18 times is stronger evidence than a site with 5 quadrats occupied 4 times, even though both proportions are similar, and the model weights them accordingly. Collapsing to a raw fraction throws that away.

# the wrong way: a linear model on raw proportions
round(coef(lm(I(k / nq) ~ moisture))[2], 3)

moisture 
   0.842

A linear model on the raw proportions returns a slope of 0.842, which is not even on the same scale as the GLM coefficient and ignores both the unequal sample sizes and the bounded 0-to-1 response. The older habit of an arcsine square-root transformation has the same interpretability problem and can produce predictions outside the valid range; for binomial data, logistic regression has both clearer meaning and more power, and the transformation is best treated as a historical relic (Warton and Hui 2011). If the proportions show more scatter than the binomial allows, that is overdispersion, and the fix is a random effect, which is the subject of the GLMM post rather than a reason to transform.

xx2 <- data.frame(moisture = seq(0, 1, length = 200))
pr2 <- predict(m2, xx2, type = "link", se.fit = TRUE)
xx2$fit <- plogis(pr2$fit)
xx2$lo  <- plogis(pr2$fit - 1.96 * pr2$se.fit)
xx2$hi  <- plogis(pr2$fit + 1.96 * pr2$se.fit)

ggplot() +
  geom_ribbon(data = xx2, aes(moisture, ymin = lo, ymax = hi), fill = sage, alpha = .3) +
  geom_line(data = xx2, aes(moisture, fit), colour = forest, linewidth = 1.1) +
  geom_point(data = data.frame(moisture, prop = k / nq, nq),
             aes(moisture, prop, size = nq), alpha = .5, colour = ink) +
  scale_size_continuous(range = c(.8, 4), name = "quadrats") +
  labs(x = "Moisture index", y = "Proportion of quadrats occupied",
       title = "The same model handles proportion data") +
  te_theme

Scatter of occupied-quadrat proportions rising with moisture; larger points carry more quadrats, and a green logistic curve with a narrow band runs through them. — Figure 2: Proportion of quadrats occupied against moisture, with point size showing the number of quadrats per site and the fitted binomial curve.

The separation trap

One failure mode is worth recognising on sight, because the model does not always announce it clearly. If a predictor perfectly splits presences from absences, the maximum likelihood estimate does not exist: the likelihood keeps increasing as the coefficient grows without bound, so the fitting algorithm never settles.

set.seed(11)
nsep <- 40
depth <- sort(runif(nsep, 0, 5))
occ <- as.integer(depth > 2.5)            # depth predicts occurrence perfectly
msep <- glm(occ ~ depth, family = binomial)

Warning: glm.fit: algorithm did not converge

Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

The warning is the first clue. The coefficient is the second.

round(summary(msep)$coefficients["depth", c("Estimate", "Std. Error")], 1)

  Estimate Std. Error 
     343.9    92792.5

range(fitted(msep))

[1] 2.220446e-16 1.000000e+00

The slope has run off to roughly 344 with a standard error in the tens of thousands, and the fitted probabilities are pinned at exactly 0 and 1. These are the fingerprints of complete separation. Albert and Anderson (1984) classified the patterns precisely: with overlap the usual estimates exist, but under complete or quasi-complete separation the finite maximum likelihood estimate does not, and the standard errors are meaningless. The number is not a strong effect; it is a non-answer.

The remedy is not to trust the output. Penalised likelihood (Firth’s method, in the logistf package) keeps the estimates finite and is the standard fix. Often the more useful response is to ask why a predictor separates the data perfectly: with a small sample it can happen by chance, and with a strong predictor it can signal that the question is really about a threshold rather than a smooth gradient.

xx3 <- data.frame(depth = seq(0, 5, length = 400))
xx3$fit <- predict(msep, xx3, type = "response")
ggplot() +
  geom_line(data = xx3, aes(depth, fit), colour = accent, linewidth = 1.1) +
  geom_point(data = data.frame(depth, occ), aes(depth, occ),
             alpha = .6, colour = ink, size = 1.8) +
  geom_vline(xintercept = 2.5, linetype = "dashed", colour = "#5d6b61") +
  labs(x = "Depth", y = "Occurrence probability",
       title = "Complete separation breaks the fit") +
  te_theme

Occurrence points are all zero below a depth of 2.5 and all one above it; the fitted red curve is a vertical step at that threshold rather than a smooth S-curve. — Figure 3: Complete separation: depth predicts occurrence perfectly, so the fitted curve collapses to a vertical step and the coefficient diverges.

Where this fits

The binomial GLM is the binary sibling of the count GLM: same iteratively reweighted least squares underneath, different link and distribution chosen to match the response. It is also the modelling end of the spatial-data workflow. The terra raster post extracts environmental values at species occurrence points; feed those values and a presence-absence column into a binomial GLM and you have a simple species distribution model. When the binary or proportion data are grouped, with repeated visits to the same sites, the independence assumption breaks and the model needs a random effect, exactly as in the GLMM post.

References

Albert & Anderson 1984 Biometrika 71(1):1-10 (10.1093/biomet/71.1.1)

Guisan & Zimmermann 2000 Ecological Modelling 135(2-3):147-186 (10.1016/S0304-3800(00)00354-9)

McCullagh & Nelder 1989 Generalized Linear Models, 2nd edition, Chapman and Hall (ISBN 0-412-31760-5)

Nelder & Wedderburn 1972 Journal of the Royal Statistical Society A 135(3):370-384 (10.2307/2344614)

Warton & Hui 2011 Ecology 92(1):3-10 (10.1890/10-0340.1)

--- title: "Logistic regression for presence-absence data" description: "Model species presence and absence with a binomial GLM in R: reading log-odds and odds ratios, fitting proportion data, and spotting the separation trap." date: "2026-06-25 11:00" categories: [regression, GLM, species distribution, R] image: thumbnail.png image-alt: "A smooth logistic S-curve rising from zero to one between two horizontal guide lines." --- A previous post worked through count data, where the response is a non-negative integer and the natural model is a Poisson or negative binomial GLM. A lot of field data is not counts but yes-or-no: a species is present or absent at a site, a seed germinates or it does not, a quadrat is occupied or empty. That binary response needs a different member of the same GLM family, the binomial model with a logit link, and it is the workhorse behind presence-absence species distribution models (Guisan and Zimmermann 2000). This post covers the three things that trip people up: reading the coefficients on the right scale, handling proportion data without resorting to a transformation, and recognising complete separation when it wrecks a fit. Everything runs in base R. ## Presence along a gradient Build a simple example: 200 sites along an elevation gradient, with presence probability rising as elevation increases. ```{r} set.seed(38) n <- 200 elev <- runif(n, 500, 2000) # metres eta <- -8 + 0.006 * elev # the true log-odds present <- rbinom(n, 1, plogis(eta)) # plogis() is the inverse logit c(prevalence = round(mean(present), 2), elev_min = round(min(elev)), elev_max = round(max(elev))) ``` The response is `present`, a vector of zeros and ones. Fit it with `glm` and the `binomial` family, which uses the logit link by default. ```{r} m1 <- glm(present ~ elev, family = binomial) round(summary(m1)$coefficients, 5) ``` ## Reading the coefficients The slope is 0.00665, and it lives on the log-odds scale, which is not where most people want to think. Three transformations move between the scales a binomial GLM speaks. The raw coefficient is the change in log-odds per unit of the predictor. Its exponential is an odds ratio. And `plogis` of the linear predictor is a probability. ```{r} b <- coef(m1)["elev"]; se <- summary(m1)$coefficients["elev","Std. Error"] # odds ratio for a 100 m increase, with a 95% interval c(OR_per_100m = exp(100 * b), lo = exp(100 * (b - 1.96 * se)), hi = exp(100 * (b + 1.96 * se))) # fitted presence probability at the ends of the gradient plogis(predict(m1, data.frame(elev = c(523, 2000)))) ``` Each additional 100 m of elevation multiplies the odds of presence by about 1.94, with a 95% interval of 1.63 to 2.32. Nearly a doubling of the odds per 100 m. The probability statement is different in kind: presence rises from about 0.005 at the bottom of the gradient to 0.990 at the top, but it does not rise linearly. The logit link makes the odds change by a constant factor everywhere, while the probability changes fastest in the middle of the curve and flattens at both ends. That non-linearity is the whole point of the link function, and it is why an odds ratio is a cleaner summary than any single statement about probability. ## How good is the fit? A binomial GLM has no residual sum of squares, so there is no ordinary R-squared. Two honest summaries are deviance-based pseudo-R-squared and a measure of how well the model ranks presences above absences. ```{r} mcfadden <- 1 - m1$deviance / m1$null.deviance ph <- predict(m1, type = "response") auc <- mean(outer(ph[present == 1], ph[present == 0], ">")) c(null_deviance = round(m1$null.deviance, 1), residual_deviance = round(m1$deviance, 1), McFadden_R2 = round(mcfadden, 3), AUC = round(auc, 3)) ``` McFadden's pseudo-R-squared is 0.508 here. It is not comparable to an ordinary R-squared, and values between 0.2 and 0.4 already indicate a good logistic fit, so a number this high reflects a strong, clean gradient. The AUC of 0.924 says that if you draw one occupied and one empty site at random, the model assigns the occupied one a higher probability 92% of the time. Both are worth reporting; neither should be read as if the model were a linear regression. ```{r} #| label: fig-logit #| fig-cap: "Observed presence and absence (jittered) along the elevation gradient, with the fitted logistic curve and its 95% interval." #| fig-alt: "Presence points cluster at high elevation and absence points at low elevation; a green S-shaped curve rises from near zero to near one across the gradient, with a shaded confidence band." #| fig-width: 8 #| fig-height: 4.8 #| message: false #| warning: false library(ggplot2) pap <- "#f5f4ee"; ink <- "#16241d"; line <- "#dad9ca" forest <- "#275139"; sage <- "#93a87f"; accent <- "#b5534e" te_theme <- theme_minimal(base_size = 12) + theme( plot.background = element_rect(fill = pap, colour = NA), panel.background = element_rect(fill = pap, colour = NA), panel.grid.minor = element_blank(), panel.grid.major = element_line(colour = line, linewidth = .3), axis.title = element_text(colour = ink), axis.text = element_text(colour = "#5d6b61"), plot.title = element_text(colour = ink, face = "bold"), plot.subtitle = element_text(colour = "#5d6b61"), legend.title = element_text(colour = ink), legend.text = element_text(colour = "#5d6b61")) xx <- data.frame(elev = seq(min(elev), max(elev), length = 200)) pr <- predict(m1, xx, type = "link", se.fit = TRUE) xx$fit <- plogis(pr$fit) xx$lo <- plogis(pr$fit - 1.96 * pr$se.fit) xx$hi <- plogis(pr$fit + 1.96 * pr$se.fit) ggplot() + geom_ribbon(data = xx, aes(elev, ymin = lo, ymax = hi), fill = sage, alpha = .3) + geom_line(data = xx, aes(elev, fit), colour = forest, linewidth = 1.1) + geom_point(data = data.frame(elev, present), aes(elev, present), position = position_jitter(height = .03), alpha = .4, colour = ink, size = 1.3) + labs(x = "Elevation (m)", y = "Presence probability", title = "A binomial GLM fits presence probability") + te_theme ``` ## Proportion data The binomial model is not only for single zeros and ones. If you visit each site and record how many of several quadrats are occupied, the response is a proportion, and the same family handles it. Pass the counts as a two-column matrix of successes and failures rather than as a raw fraction. ```{r} set.seed(74) ns <- 120 moisture <- runif(ns, 0, 1) nq <- sample(5:20, ns, replace = TRUE) # quadrats per site, unequal k <- rbinom(ns, nq, plogis(-2 + 4 * moisture)) m2 <- glm(cbind(k, nq - k) ~ moisture, family = binomial) round(summary(m2)$coefficients, 3) ``` The fit recovers the generating values closely (intercept near -2, slope near 4). The two-column response matters because it carries the sample size: a site with 20 quadrats occupied 18 times is stronger evidence than a site with 5 quadrats occupied 4 times, even though both proportions are similar, and the model weights them accordingly. Collapsing to a raw fraction throws that away. ```{r} # the wrong way: a linear model on raw proportions round(coef(lm(I(k / nq) ~ moisture))[2], 3) ``` A linear model on the raw proportions returns a slope of 0.842, which is not even on the same scale as the GLM coefficient and ignores both the unequal sample sizes and the bounded 0-to-1 response. The older habit of an arcsine square-root transformation has the same interpretability problem and can produce predictions outside the valid range; for binomial data, logistic regression has both clearer meaning and more power, and the transformation is best treated as a historical relic (Warton and Hui 2011). If the proportions show more scatter than the binomial allows, that is overdispersion, and the fix is a random effect, which is the subject of the GLMM post rather than a reason to transform. ```{r} #| label: fig-prop #| fig-cap: "Proportion of quadrats occupied against moisture, with point size showing the number of quadrats per site and the fitted binomial curve." #| fig-alt: "Scatter of occupied-quadrat proportions rising with moisture; larger points carry more quadrats, and a green logistic curve with a narrow band runs through them." #| fig-width: 8 #| fig-height: 4.8 #| message: false #| warning: false xx2 <- data.frame(moisture = seq(0, 1, length = 200)) pr2 <- predict(m2, xx2, type = "link", se.fit = TRUE) xx2$fit <- plogis(pr2$fit) xx2$lo <- plogis(pr2$fit - 1.96 * pr2$se.fit) xx2$hi <- plogis(pr2$fit + 1.96 * pr2$se.fit) ggplot() + geom_ribbon(data = xx2, aes(moisture, ymin = lo, ymax = hi), fill = sage, alpha = .3) + geom_line(data = xx2, aes(moisture, fit), colour = forest, linewidth = 1.1) + geom_point(data = data.frame(moisture, prop = k / nq, nq), aes(moisture, prop, size = nq), alpha = .5, colour = ink) + scale_size_continuous(range = c(.8, 4), name = "quadrats") + labs(x = "Moisture index", y = "Proportion of quadrats occupied", title = "The same model handles proportion data") + te_theme ``` ## The separation trap One failure mode is worth recognising on sight, because the model does not always announce it clearly. If a predictor perfectly splits presences from absences, the maximum likelihood estimate does not exist: the likelihood keeps increasing as the coefficient grows without bound, so the fitting algorithm never settles. ```{r} set.seed(11) nsep <- 40 depth <- sort(runif(nsep, 0, 5)) occ <- as.integer(depth > 2.5) # depth predicts occurrence perfectly msep <- glm(occ ~ depth, family = binomial) ``` The warning is the first clue. The coefficient is the second. ```{r} round(summary(msep)$coefficients["depth", c("Estimate", "Std. Error")], 1) range(fitted(msep)) ``` The slope has run off to roughly 344 with a standard error in the tens of thousands, and the fitted probabilities are pinned at exactly 0 and 1. These are the fingerprints of complete separation. Albert and Anderson (1984) classified the patterns precisely: with overlap the usual estimates exist, but under complete or quasi-complete separation the finite maximum likelihood estimate does not, and the standard errors are meaningless. The number is not a strong effect; it is a non-answer. The remedy is not to trust the output. Penalised likelihood (Firth's method, in the `logistf` package) keeps the estimates finite and is the standard fix. Often the more useful response is to ask why a predictor separates the data perfectly: with a small sample it can happen by chance, and with a strong predictor it can signal that the question is really about a threshold rather than a smooth gradient. ```{r} #| label: fig-sep #| fig-cap: "Complete separation: depth predicts occurrence perfectly, so the fitted curve collapses to a vertical step and the coefficient diverges." #| fig-alt: "Occurrence points are all zero below a depth of 2.5 and all one above it; the fitted red curve is a vertical step at that threshold rather than a smooth S-curve." #| fig-width: 8 #| fig-height: 4.4 #| message: false #| warning: false xx3 <- data.frame(depth = seq(0, 5, length = 400)) xx3$fit <- predict(msep, xx3, type = "response") ggplot() + geom_line(data = xx3, aes(depth, fit), colour = accent, linewidth = 1.1) + geom_point(data = data.frame(depth, occ), aes(depth, occ), alpha = .6, colour = ink, size = 1.8) + geom_vline(xintercept = 2.5, linetype = "dashed", colour = "#5d6b61") + labs(x = "Depth", y = "Occurrence probability", title = "Complete separation breaks the fit") + te_theme ``` ## Where this fits The binomial GLM is the binary sibling of the [count GLM](../glm-count-data-abundance/): same iteratively reweighted least squares underneath, different link and distribution chosen to match the response. It is also the modelling end of the spatial-data workflow. The [terra raster post](../terra-raster-basics/) extracts environmental values at species occurrence points; feed those values and a presence-absence column into a binomial GLM and you have a simple species distribution model. When the binary or proportion data are grouped, with repeated visits to the same sites, the independence assumption breaks and the model needs a random effect, exactly as in the [GLMM post](../glmm-nested-counts-pseudoreplication/). ## References Albert & Anderson 1984 Biometrika 71(1):1-10 (10.1093/biomet/71.1.1) Guisan & Zimmermann 2000 Ecological Modelling 135(2-3):147-186 (10.1016/S0304-3800(00)00354-9) McCullagh & Nelder 1989 Generalized Linear Models, 2nd edition, Chapman and Hall (ISBN 0-412-31760-5) Nelder & Wedderburn 1972 Journal of the Royal Statistical Society A 135(3):370-384 (10.2307/2344614) Warton & Hui 2011 Ecology 92(1):3-10 (10.1890/10-0340.1)