library(survival)
library(ggplot2)
library(dplyr)Cox proportional hazards and the PH assumption
The Kaplan-Meier estimator compares survival between a few groups but cannot adjust for a continuous covariate or hold several predictors constant at once. The Cox proportional hazards model does both. It relates the hazard, the instantaneous risk of the event, to covariates without committing to a shape for the baseline hazard, which makes it the workhorse of survival analysis in ecology and everywhere else. The price of that flexibility is one assumption: each covariate multiplies the hazard by a constant factor that does not change over time. This post fits a Cox model in base R with the recommended survival package, interprets the hazard ratios, and then spends most of its length on the assumption, because a hazard ratio quoted from a model that violates it is a single number standing in for an effect that actually reverses.
Survival data with covariates
The example follows 220 tagged individuals to death or censoring, each carrying two covariates: a standardised parasite load, and a life stage, juvenile or adult. Parasite load is built to act proportionally, raising the hazard by a fixed multiple. Stage is built to violate that rule: juveniles start at high risk that eases as they grow, adults start safe and become frailer, so the two hazards cross rather than staying in fixed ratio.
set.seed(414)
n <- 220
parasite <- round(rnorm(n), 3) # standardised parasite load
stage <- factor(sample(c("Juvenile", "Adult"), n, replace = TRUE))
shape <- ifelse(stage == "Juvenile", 0.85, 1.75) # different shape => non-proportional
scale <- 26 * exp(-0.45 * parasite) # higher load => shorter life (proportional)
tt <- rweibull(n, shape = shape, scale = scale)
ct <- pmin(rexp(n, 1 / 70), 42) # loss to follow-up plus admin end
time <- pmin(tt, ct); status <- as.integer(tt <= ct)
d <- data.frame(time = round(time, 2), status = status, parasite = parasite, stage = stage)
c(n = n, events = sum(d$status), censored_fraction = round(mean(d$status == 0), 3)) n events censored_fraction
220.000 140.000 0.364
Fitting the model and reading hazard ratios
A Cox model reports one coefficient per covariate on the log-hazard scale. Exponentiating gives a hazard ratio: the factor by which the hazard is multiplied for a one-unit rise in the covariate.
m <- coxph(Surv(time, status) ~ parasite + stage, data = d)
summary(m)$coefficients coef exp(coef) se(coef) z Pr(>|z|)
parasite 0.5849367 1.7948774 0.09022633 6.482993 8.992045e-11
stageJuvenile -0.3505588 0.7042944 0.17368420 -2.018369 4.355283e-02
round(exp(confint(m)), 3) 2.5 % 97.5 %
parasite 1.504 2.142
stageJuvenile 0.501 0.990
Each standard deviation of parasite load multiplies the hazard by about 1.79, a 79 per cent rise in risk, with a confidence interval from roughly 1.50 to 2.14 that sits well clear of one. The stage coefficient reports a hazard ratio near 0.70 for juveniles against adults, seemingly a 30 per cent lower risk. The concordance of the model is about 0.641, modestly better than a coin toss at ordering who dies first. The parasite effect is trustworthy. The stage hazard ratio is not, and the reason is the assumption behind it.
Testing the proportional-hazards assumption
If a covariate multiplies the hazard by a constant, the Schoenfeld residuals for that covariate, the gap at each death between the covariate value of the individual who died and the average among those still at risk, should show no trend against time. A slope in those residuals means the effect is changing. The cox.zph function tests each covariate and the model as a whole.
z <- cox.zph(m)
z chisq df p
parasite 1.36 1 0.24
stage 17.77 1 2.5e-05
GLOBAL 21.33 2 2.3e-05
Parasite load passes comfortably, with a p-value near 0.24: its effect is steady over the follow-up. Stage fails hard, with a p-value around 0.00002 and a global test that fails with it. The single hazard ratio of 0.70 is therefore an average of an effect that does not hold still. Plotting the scaled Schoenfeld residuals for stage against time shows what the test detects.
sr <- data.frame(time = z$time, resid = z$y[, "stage"])
beta_stage <- coef(m)["stageJuvenile"]
ggplot(sr, aes(time, resid)) +
geom_hline(yintercept = beta_stage, colour = te_faint, linetype = "42", linewidth = 0.5) +
geom_point(colour = te_sage, size = 1.1, alpha = 0.8) +
geom_smooth(method = "loess", formula = y ~ x, se = TRUE,
colour = te_brick, fill = te_brick, alpha = 0.15, linewidth = 0.9) +
annotate("text", x = max(sr$time) * 0.98, y = beta_stage - 0.55,
label = "constant-effect reference", colour = te_faint, hjust = 1, size = 3.0) +
labs(title = "The stage effect is not proportional",
subtitle = "Scaled Schoenfeld residuals trend with time instead of staying flat",
x = "Months since tagging", y = "Scaled Schoenfeld residual (stage)") +
theme_te()
The residuals drift from positive to negative across the follow-up: early on the juvenile hazard sits above the adult one, later it sits below. A single ratio cannot describe an effect that reverses, and reporting one would mislead about both the size and the direction of the stage difference at any particular age.
What the crossing looks like, and how to handle it
The mechanism is visible in the raw survival curves. Where proportional hazards hold, two Kaplan-Meier curves never cross; here they do, because the stage hazards swap order partway through.
kfit <- survfit(Surv(time, status) ~ stage, data = d)
grid <- seq(0, 42, by = 0.25)
kd <- lapply(levels(d$stage), function(g) {
s <- summary(kfit[paste0("stage=", g)])
data.frame(stage = g, time = grid, surv = stepfun(s$time, c(1, s$surv))(grid))
}) |> bind_rows()
ggplot(kd, aes(time, surv, colour = stage)) +
geom_line(linewidth = 0.9) +
scale_colour_manual(values = c(Adult = te_forest, Juvenile = te_gold)) +
annotate("text", x = 40, y = 0.30, label = "Adult", colour = te_forest, hjust = 1, size = 3.2) +
annotate("text", x = 40, y = 0.05, label = "Juvenile", colour = "#8a7a30", hjust = 1, size = 3.2) +
coord_cartesian(ylim = c(0, 1)) +
labs(title = "Crossing curves break proportional hazards",
subtitle = "Juveniles die faster early, adults faster late",
x = "Months since tagging", y = "Survival probability") +
theme_te()
The standard fix is to stratify by the offending covariate. Stratification lets each stage keep its own unspecified baseline hazard, so the model no longer forces a constant stage ratio, and estimates the parasite effect within strata.
m2 <- coxph(Surv(time, status) ~ parasite + strata(stage), data = d)
c(parasite_HR = round(exp(coef(m2)["parasite"]), 3),
lower = round(exp(confint(m2)["parasite", 1]), 3),
upper = round(exp(confint(m2)["parasite", 2]), 3))parasite_HR.parasite lower upper
1.892 1.575 2.272
cox.zph(m2)$table["GLOBAL", ] chisq df p
3.23277738 1.00000000 0.07217818
Inside the strata the parasite hazard ratio is about 1.89, close to its earlier value and now free of contamination from the mis-modelled stage effect, and the assumption test on the stratified model no longer fails, with a global p-value near 0.07. Stage is no longer reported as a hazard ratio at all, which is honest: its effect was never a single number. When the changing effect itself is of interest, a time-by-covariate interaction estimates how the hazard ratio moves rather than hiding it.
Where to go next
The Cox model extends the survival analysis begun with Kaplan-Meier curves to covariates, and its assumption check is one instance of the wider habit of testing what a model takes for granted, the same discipline as residual diagnostics for a generalised linear model. When survival must be estimated from repeated captures instead of known fates, the Cormack-Jolly-Seber model plays the Cox role for detection-limited data, and when the response is whether a nest survives a monitoring interval, the logistic-exposure model carries covariates into daily survival.
References
Cox DR 1972. Journal of the Royal Statistical Society: Series B 34(2):187-202 (10.1111/j.2517-6161.1972.tb00899.x).
Schoenfeld D 1982. Biometrika 69(1):239-241 (10.1093/biomet/69.1.239).
Grambsch PM, Therneau TM 1994. Biometrika 81(3):515-526 (10.1093/biomet/81.3.515).
Therneau TM, Grambsch PM 2000. Modeling Survival Data: Extending the Cox Model. Springer. ISBN 978-0-387-98784-2.