From QGIS to R: a spatial join and a clean GeoPackage handoff

R
QGIS
sf
spatial
GeoPackage
Load occurrence points in QGIS, tag each one with the habitat patch it falls in using a spatial join, export to GeoPackage, and pick the analysis up in R with sf.
Author

Tidy Ecology

Published

2026-06-23

A lot of ecological analysis starts in a GUI and finishes in code. You load layers in QGIS, look at them, fix the obvious problems, and then move to R for the statistics. This post walks one small and very common version of that handoff. You have species occurrence points and a set of habitat polygons, and you want every point labelled with the patch it sits inside. QGIS does the looking and the point-in-polygon assignment; R reads the result and summarises it.

The honest framing first: the assignment itself is one line of sf. So this is not a case where QGIS does something R cannot. What QGIS gives you here is a place to see the data before you trust it, and a searchable catalogue of spatial operations you do not have to remember by name. The post shows both routes and says where each one earns its place.

The screenshots are from QGIS 3.44 LTR. The menu paths are the same in QGIS 4.0, so nothing below goes stale if you are on the newer release.

The sample data

Everything here runs on synthetic data so you can reproduce it without any field files. The chunk below builds 110 occurrence points scattered around a single coordinate, plus five habitat polygons. Some points fall outside every polygon on purpose, because a real join almost always leaves a few records unmatched, and handling those is part of the job. It also writes the two files you will open in QGIS: occurrences.csv and habitat_patches.gpkg. Note that the CSV has no habitat column. That column is exactly what the join is about to produce.

library(sf)
set.seed(46772359)

# five habitat polygons (lon/lat, EPSG:4326)
mk <- function(m) st_polygon(list(rbind(m, m[1, , drop = FALSE])))
patches <- st_sf(
  patch_id = c("P1", "P2", "P3", "P4", "P5"),
  habitat  = c("forest", "grassland", "wetland", "scrub", "arable"),
  geometry = st_sfc(
    mk(matrix(c(23.520,46.740, 23.570,46.740, 23.570,46.780, 23.520,46.780), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.580,46.750, 23.630,46.750, 23.630,46.790, 23.580,46.790), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.550,46.790, 23.600,46.790, 23.600,46.815, 23.550,46.815), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.635,46.745, 23.685,46.745, 23.685,46.785, 23.635,46.785), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.515,46.785, 23.550,46.785, 23.550,46.815, 23.515,46.815), ncol = 2, byrow = TRUE)),
    crs = 4326))

# occurrence points: oversample, then keep a controlled mix of inside and outside
cand <- st_as_sf(
  data.frame(lon = runif(2000, 23.50, 23.70), lat = runif(2000, 46.72, 46.82)),
  coords = c("lon", "lat"), crs = 4326, remove = FALSE)
membership <- st_join(cand, patches["habitat"], join = st_intersects)$habitat
inside  <- which(!is.na(membership))
outside <- which(is.na(membership))
keep <- sample(c(sample(inside, 85), sample(outside, 25)))
sel  <- cand[keep, ]
xy   <- st_coordinates(sel)

# species, with a mild habitat preference so there is a signal to find later
inside_hab <- membership[keep]
fav <- list(forest = c("sp1","sp2"), grassland = c("sp3","sp4"), wetland = c("sp5","sp6"),
            scrub = c("sp7","sp8"), arable = c("sp4","sp8"))
spp <- paste0("sp", 1:8)
species <- vapply(seq_along(inside_hab), function(i) {
  if (is.na(inside_hab[i])) return(sample(spp, 1))
  w <- setNames(rep(1, 8), spp); w[fav[[inside_hab[i]]]] <- 4
  sample(spp, 1, prob = w)
}, character(1))

occ <- data.frame(
  site_id   = sprintf("S%03d", seq_along(keep)),
  lon       = round(xy[, 1], 6),
  lat       = round(xy[, 2], 6),
  species   = species,
  abundance = rpois(length(keep), 6) + 1L
)

write.csv(occ, "occurrences.csv", row.names = FALSE)
st_write(patches, "habitat_patches.gpkg", delete_dsn = TRUE, quiet = TRUE)

knitr::kable(head(occ), caption = "First rows of the synthetic occurrence table.")
First rows of the synthetic occurrence table.
site_id lon lat species abundance
S001 23.62643 46.77211 sp4 7
S002 23.63556 46.75372 sp8 6
S003 23.65438 46.79975 sp3 7
S004 23.57296 46.80512 sp3 6
S005 23.60224 46.79576 sp6 8
S006 23.52232 46.74981 sp5 2

In QGIS

1. Load the points. Layer then Add Layer then Add Delimited Text Layer. Pick occurrences.csv, set the format to CSV, choose point geometry with X field = lon and Y field = lat, and set the geometry CRS to EPSG:4326 - WGS 84. The dialog shows you a preview of the table before you commit, which is a good moment to confirm the coordinate columns were read as numbers and not text.

The Add Delimited Text Layer dialog, with lon and lat mapped to the X and Y fields and the geometry CRS set to EPSG:4326.

2. Add the polygons and look. Drag habitat_patches.gpkg onto the canvas, or load it through Add Vector Layer. Put the point layer above the polygons in the Layers panel. Now you can see the thing you are about to compute: most points land inside a patch, and a handful sit outside every one of them. That second group is the reason the join needs a deliberate decision later.

The QGIS canvas showing occurrence points over five habitat polygons. Several points fall outside any polygon.

3. Move to a projected CRS. The points came in as longitude and latitude, which is fine for the join but awkward for anything measured in metres. Open Project then Properties then CRS (or click the CRS button at the bottom right of the window), filter for 3844, and choose EPSG:3844 - Pulkovo 1942(58) / Stereo70, the national projected system for Romania. The map reprojects on the fly. A spatial join by intersection gives the same answer in either CRS, but working in metres keeps distances and areas honest once you start measuring things.

The CRS selector filtered to 3844, with EPSG:3844 Stereo70 selected and its area of use shown.

4. Run the spatial join. Open Processing then Toolbox, and search for Join attributes by location. Set the base layer to occurrences and the join layer to habitat_patches. Use the intersects predicate, and a join type of take attributes of the first matching feature only, since each point sits in at most one patch. Leave Discard records which could not be joined unchecked. That box is the whole decision: unchecked, the 25 outside points stay in the output with an empty habitat, which is what you want so you can account for them rather than lose them silently.

The Join attributes by location dialog, set to an intersects predicate, one-to-one join, with the discard option left unchecked.

5. Export to GeoPackage. Right-click the joined layer, then Export then Save Features As. Choose GeoPackage as the format, name the file occurrences_joined.gpkg, and set the output CRS to EPSG:3844. GeoPackage is the clean way to hand vector data to R: one file, an open OGC standard, and none of the four-file shapefile baggage. That single file is the entire handoff.

Back in R

QGIS wrote occurrences_joined.gpkg, and you can read it straight back:

library(sf)
occ_tagged <- st_read("occurrences_joined.gpkg")

To keep this post reproducible without anyone repeating the clicks, here is the same point-in-polygon assignment as the single sf call it really is. This is what the figures below use, and it returns the same labels QGIS produced:

occ_sf <- st_as_sf(occ, coords = c("lon", "lat"), crs = 4326)
occ_tagged <- st_join(occ_sf, patches["habitat"], join = st_intersects)
occ_tagged$habitat[is.na(occ_tagged$habitat)] <- "unassigned"

table(occ_tagged$habitat)

    arable     forest  grassland      scrub unassigned    wetland 
        11         19         27         15         25         13 

The 25 points outside every patch now carry the label unassigned instead of a missing value, which keeps them visible in summaries. Here is the joined result drawn in R, projected to Stereo70:

library(ggplot2)
hab_levels <- c("forest", "grassland", "wetland", "scrub", "arable", "unassigned")
hab_cols <- c(forest = "#2f6b3c", grassland = "#8aa84b", wetland = "#2f6f93",
              scrub = "#b06d35", arable = "#caa12e", unassigned = "#b8b3a7")

occ_3844     <- st_transform(occ_tagged, 3844)
patches_3844 <- st_transform(patches, 3844)
occ_3844$habitat     <- factor(occ_3844$habitat, levels = hab_levels)
patches_3844$habitat <- factor(patches_3844$habitat, levels = hab_levels)

ggplot() +
  geom_sf(data = patches_3844, aes(fill = habitat), colour = "grey35", linewidth = 0.3, alpha = 0.18) +
  geom_sf(data = occ_3844, aes(colour = habitat), size = 2) +
  scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") +
  scale_colour_manual(values = hab_cols, drop = FALSE, name = "habitat") +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(panel.grid = element_line(colour = "grey92"))
A map of occurrence points over five habitat polygons. Points are coloured by habitat; points outside any polygon are grey.
Figure 1: The joined points, coloured by the habitat patch each one falls in. Grey points lie outside every patch.

With a habitat column attached, the kind of summary you came for is now ordinary data work. Counting species richness and total abundance per habitat takes a single grouped summary:

library(dplyr)
per_habitat <- occ_tagged |>
  st_drop_geometry() |>
  filter(habitat != "unassigned") |>
  group_by(habitat) |>
  summarise(richness = n_distinct(species),
            total_abundance = sum(abundance),
            records = n(),
            .groups = "drop")

knitr::kable(per_habitat, caption = "Richness and abundance per habitat, after the join.")
Richness and abundance per habitat, after the join.
habitat richness total_abundance records
arable 5 75 11
forest 8 108 19
grassland 7 179 27
scrub 7 106 15
wetland 7 96 13
library(tidyr)
per_habitat |>
  pivot_longer(c(richness, total_abundance), names_to = "metric", values_to = "value") |>
  mutate(metric = factor(metric, c("richness", "total_abundance"),
                         c("species richness", "total abundance")),
         habitat = factor(habitat, hab_levels)) |>
  ggplot(aes(habitat, value, fill = habitat)) +
  geom_col(width = 0.7, colour = "grey30", linewidth = 0.25) +
  facet_wrap(~ metric, scales = "free_y") +
  scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_text(angle = 30, hjust = 1),
        strip.text = element_text(face = "bold"))
Two bar panels. The left panel shows species richness per habitat; the right panel shows total abundance per habitat.
Figure 2: Species richness and total abundance per habitat for the matched points.

Which tool, when

A spatial join is a fair test of where the line sits. In sf it is one call, it is scripted, and it drops straight into a pipeline, so for anything you want to rerun or version it belongs in R. QGIS earns its place just before that: it is where you confirm the points landed where they should, where you catch the ones that fell in the sea because a CRS was set wrong, and where you reach for an operation by browsing the toolbox instead of recalling its name. The GeoPackage is what keeps the two honest with each other, a single standardised file passed from the place you looked at the data to the place you analyse it.

References

Pebesma, E. (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10(1), 439-446. https://doi.org/10.32614/RJ-2018-009