Contrasts and post-hoc comparisons in R

R
linear models
ANOVA
multiple comparisons
ecology tutorial
After a significant ANOVA, which groups differ? Treatment versus sum coding, planned and trend contrasts, Tukey HSD, and correcting pairwise tests.
Author

Tidy Ecology

Published

2026-06-26

A one-way ANOVA tells you that some group means differ, and nothing more. The previous post showed that the test is just a linear model with a factor, and that a significant F leaves two questions open. First, the coefficients you get depend on how the factor is coded, so you need to choose a coding that answers the comparison you care about. Second, once you start asking which specific groups differ, you are running several tests at once, and the chance of a false positive grows with each one. This post covers both: contrast coding, planned and trend contrasts, and post-hoc pairwise comparisons with a multiplicity correction.

The data is a nitrogen addition experiment: aboveground biomass measured on plots at four nitrogen levels, fifteen plots each. Fertilisation raises biomass, which is one of the better-documented effects in grassland ecology.

set.seed(116)
ng  <- 15
lev <- c("Control", "Low", "Medium", "High")
nit <- data.frame(
  nitrogen = factor(rep(lev, each = ng), levels = lev),
  biomass  = c(rnorm(ng, 250, 50), rnorm(ng, 320, 50),
               rnorm(ng, 400, 50), rnorm(ng, 430, 50))
)

m <- lm(biomass ~ nitrogen, data = nit)
anova(m)
Analysis of Variance Table

Response: biomass
          Df Sum Sq Mean Sq F value    Pr(>F)    
nitrogen   3 259374   86458  34.416 9.614e-13 ***
Residuals 56 140680    2512                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The omnibus F is 34.4 on 3 and 56 degrees of freedom, with a p-value near 9.6e-13: the four means are not all equal, and nitrogen explains about 65 percent of the variation in biomass. Group means run from 268 g/m^2 in the control to 439 g/m^2 at high nitrogen, with a grand mean of 350.

ggplot(nit, aes(nitrogen, biomass, colour = nitrogen)) +
  geom_jitter(width = 0.12, height = 0, size = 2, alpha = 0.55) +
  stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.16, linewidth = 0.7) +
  stat_summary(fun = mean, geom = "point", size = 4, shape = 18) +
  scale_colour_manual(values = n_cols) +
  labs(x = "Nitrogen addition", y = expression(Aboveground~biomass~(g/m^2))) +
  base_theme

Jittered biomass for control, low, medium and high nitrogen, rising from about 270 to 440 grams per square metre. Control and low overlap; medium and high are clearly higher.

Biomass by nitrogen level. Points are plots, diamonds are means, bars are one standard error.

Two ways to code the same factor

The coefficients from lm are not the group means; they are contrasts, and which contrasts you see depends on the coding scheme. R uses treatment coding by default, which we met in the previous post: the first level becomes a reference folded into the intercept, and the other coefficients are differences from it.

summary(m)$coefficients
                Estimate Std. Error   t value     Pr(>|t|)
(Intercept)    268.32183   12.94128 20.733801 5.993993e-28
nitrogenLow     40.74862   18.30173  2.226490 3.002132e-02
nitrogenMedium 113.29365   18.30173  6.190326 7.438595e-08
nitrogenHigh   170.82045   18.30173  9.333570 5.236574e-13

The intercept, 268, is the control mean. Each other coefficient is the rise above the control: low adds 41 g/m^2, medium adds 113, high adds 171. This is the right coding when one level is a natural baseline, which here it is.

Sum-to-zero coding answers a different question. It makes the intercept the grand mean and each coefficient a deviation from it, with the last level dropped because the deviations sum to zero.

m_sum <- lm(biomass ~ nitrogen, data = nit, contrasts = list(nitrogen = contr.sum))
summary(m_sum)$coefficients
             Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 349.53751   6.470638 54.019021 5.313564e-50
nitrogen1   -81.21568  11.207473 -7.246565 1.354144e-09
nitrogen2   -40.46706  11.207473 -3.610721 6.531004e-04
nitrogen3    32.07797  11.207473  2.862195 5.906976e-03

Now the intercept is 350, the grand mean, and its standard error (6.5) is smaller than the treatment intercept’s (12.9), because the grand mean pools all sixty plots while the control mean used only fifteen. The three coefficients are the deviations of control, low and medium from the grand mean (-81, -40, +32); the high deviation is not printed, but it is the negative of their sum, +90, which is exactly high minus the grand mean. The two models describe the same fit: their fitted values are identical to machine precision. Coding changes what the parameters mean, not what the model predicts. Sum coding is the convention behind classical ANOVA tables and is the sensible choice when no level is special, for instance when you later add an interaction and want main effects that are not tied to one reference cell.

Planned contrasts

A contrast is a weighted combination of group means with weights that sum to zero, and it lets you test a comparison decided in advance rather than fishing through all pairs. Two kinds are worth knowing.

When the factor is ordered, polynomial contrasts split the effect into a linear trend, a quadratic (curvature), a cubic, and so on. Declaring nitrogen ordered makes R fit these automatically.

nit$nit_ord <- factor(nit$nitrogen, levels = lev, ordered = TRUE)
summary(lm(biomass ~ nit_ord, data = nit))$coefficients
             Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 349.53751   6.470638 54.0190210 5.313564e-50
nit_ord.L   130.81140  12.941275 10.1080766 3.092667e-14
nit_ord.Q     8.38909  12.941275  0.6482429 5.194772e-01
nit_ord.C   -10.46807  12.941275 -0.8088903 4.220016e-01

The linear term is large and highly significant (t = 10.1), while the quadratic (p = 0.52) and cubic (p = 0.42) are not: biomass climbs with nitrogen along an essentially straight line over this range, with no detectable bend. A single linear-trend test like this is far more focused, and more powerful, than an omnibus F when a monotone dose response is what you expect.

Any other planned comparison is a contrast vector applied to the group means. Suppose the question is whether adding nitrogen at all changes biomass: control against the average of the three fertilised levels. The weights are c(1, -1/3, -1/3, -1/3), and the estimate, its standard error, and a t test come straight from the group means and the model’s residual variance.

gm     <- tapply(nit$biomass, nit$nitrogen, mean)
Lvec   <- c(1, -1/3, -1/3, -1/3)
sigma2 <- summary(m)$sigma^2
est    <- sum(Lvec * gm)
se     <- sqrt(sigma2 * sum(Lvec^2 / ng))
c(estimate = est, se = se, t = est / se,
  p = 2 * pt(-abs(est / se), df = m$df.residual))
     estimate            se             t             p 
-1.082876e+02  1.494330e+01 -7.246565e+00  1.354144e-09 

Control biomass sits 108 g/m^2 below the fertilised average (t = -7.2, p near 1e-09). Because this is one comparison chosen before seeing the data, it needs no correction: the multiplicity problem only arises when you test many comparisons at once, which is what post-hoc analysis does.

Post-hoc: which pairs differ?

With no prior hypothesis you may want every pairwise comparison. For four groups that is six tests, and testing six things at the 0.05 level means the chance of at least one false positive is well above 0.05. Tukey’s honest significant difference is built for exactly this case: it compares all pairs while holding the family-wise error rate at the stated level, using the distribution of the largest difference among the means.

TukeyHSD(aov(biomass ~ nitrogen, data = nit))$nitrogen
                    diff        lwr       upr        p adj
Low-Control     40.74862  -7.712292  89.20953 1.286100e-01
Medium-Control 113.29365  64.832739 161.75456 4.408819e-07
High-Control   170.82045 122.359537 219.28136 1.047651e-11
Medium-Low      72.54503  24.084119 121.00594 1.179911e-03
High-Low       130.07183  81.610918 178.53274 1.373005e-08
High-Medium     57.52680   9.065887 105.98771 1.382221e-02
td <- as.data.frame(TukeyHSD(aov(biomass ~ nitrogen, data = nit))$nitrogen)
td$pair <- rownames(td)
td$sig  <- factor(td$`p adj` < 0.05, levels = c(TRUE, FALSE))

ggplot(td, aes(diff, reorder(pair, diff), colour = sig)) +
  geom_vline(xintercept = 0, linetype = "dashed", colour = faint, linewidth = 0.5) +
  geom_errorbarh(aes(xmin = lwr, xmax = upr), height = 0.22, linewidth = 0.8) +
  geom_point(size = 3.2) +
  scale_colour_manual(values = c("TRUE" = forest, "FALSE" = "#c4ad5e")) +
  labs(x = expression(Difference~"in"~mean~biomass~(g/m^2)), y = NULL) +
  base_theme
Warning: `geom_errorbarh()` was deprecated in ggplot2 4.0.0.
ℹ Please use the `orientation` argument of `geom_errorbar()` instead.
`height` was translated to `width`.

Forest plot of six pairwise biomass differences with confidence intervals. Five intervals lie entirely above zero and are green; the low-versus-control interval crosses zero and is gold.

Tukey HSD pairwise differences with family-wise 95 percent intervals. Green intervals exclude zero; the gold one (low versus control) does not.

Five of the six pairs differ. The exception is low versus control: the difference is 41 g/m^2 but the family-wise interval runs from -8 to 89 and includes zero, so the lowest dose is not separable from the control. Everything from medium upward is.

pairwise.t.test runs the same comparisons and lets you choose the correction explicitly. The choice matters most at the boundary.

sapply(c("none", "bonferroni", "holm", "BH"),
       function(a) pairwise.t.test(nit$biomass, nit$nitrogen,
                                   p.adjust.method = a)$p.value["Low", "Control"])
      none bonferroni       holm         BH 
0.03002132 0.18012795 0.03002132 0.03002132 

Low versus control is the borderline case. Unadjusted it is significant (0.03); Holm and Benjamini-Hochberg keep it significant (0.03), Bonferroni does not (0.18), and Tukey called it non-significant too. Four valid procedures, two verdicts. The honest reading is that the lowest dose is right at the edge of detectability, and the label depends on the method, so report the estimate and its interval rather than leaning on a single star.

Why the correction is not optional

To see what unadjusted pairwise tests cost, simulate the situation where there is genuinely no effect: four groups drawn from one distribution, all six pairwise comparisons run, repeated two thousand times. The family-wise error rate is the fraction of datasets where at least one comparison comes out significant.

set.seed(909)
R <- 2000; ng2 <- 12; k <- 4
g  <- factor(rep(seq_len(k), each = ng2))
sims <- replicate(R, {
  y    <- rnorm(ng2 * k)
  praw <- pairwise.t.test(y, g, p.adjust.method = "none")$p.value
  praw <- praw[!is.na(praw)]
  c(none = min(praw),
    bonferroni = min(p.adjust(praw, "bonferroni")),
    holm = min(p.adjust(praw, "holm")))
})
fwer <- rowMeans(sims < 0.05)
round(fwer, 4)
      none bonferroni       holm 
    0.1935     0.0380     0.0380 
fd <- data.frame(method = factor(names(fwer), levels = names(fwer)),
                 rate = as.numeric(fwer))
levels(fd$method) <- c("Unadjusted", "Bonferroni", "Holm")

ggplot(fd, aes(method, rate, fill = method)) +
  geom_col(width = 0.6) +
  geom_hline(yintercept = 0.05, linetype = "dashed", colour = "#b5534e", linewidth = 0.6) +
  geom_text(aes(label = sprintf("%.1f%%", 100 * rate)), vjust = -0.5, colour = body, size = 4) +
  scale_fill_manual(values = c(Unadjusted = "#b5534e", Bonferroni = "#7a9a4e", Holm = "#275139")) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.12)),
                     labels = function(x) paste0(x * 100, "%")) +
  labs(x = NULL, y = "Family-wise false-positive rate") +
  base_theme

Bar chart of family-wise false-positive rates: unadjusted near 19 percent, well above a dashed line at 5 percent; Bonferroni and Holm both near 4 percent, below the line.

Family-wise false-positive rate under the global null, by correction. The dashed line marks the nominal 5 percent.

Without a correction, about one run in five flags a difference that is not there: a 19 percent error rate where 5 percent was intended. Bonferroni and Holm both pull it back below the nominal level (to about 0.04). They give the same family-wise protection here, and that is not a coincidence: family-wise error is controlled by the smallest p-value, and both methods multiply the smallest p by the same factor. Where Holm wins is everywhere else. It penalises the larger p-values less, which is why it kept low-versus-control significant on the real data while Bonferroni did not. Use Holm over Bonferroni by default; use Benjamini-Hochberg when you can tolerate a controlled false discovery rate in exchange for more power across many comparisons.

In day-to-day work the emmeans package (Lenth) does all of this in one place: estimated marginal means, any contrast you write down, and the matching adjustment, including for models with covariates and interactions where hand computation gets awkward. Everything above is base R so the machinery is visible, but emmeans is the tool to reach for once you understand what it is doing.

The same logic carries to the multivariate setting. When the response is a whole community rather than one number, the omnibus test is PERMANOVA and the follow-up is pairwise PERMANOVA with the identical multiplicity problem and the identical fixes, which the pairwise PERMANOVA post works through.

References

Tukey 1949 Biometrics 5(2):99-114 (10.2307/3001913).

Dunn 1961 Journal of the American Statistical Association 56(293):52-64 (10.1080/01621459.1961.10482090).

Holm 1979 Scandinavian Journal of Statistics 6(2):65-70 (10.2307/4615733).

Benjamini and Hochberg 1995 Journal of the Royal Statistical Society Series B 57(1):289-300 (10.1111/j.2517-6161.1995.tb02031.x).