From QGIS to R: a spatial join and a clean GeoPackage handoff

QGIS

spatial

GeoPackage

Load occurrence points in QGIS, tag each one with the habitat patch it falls in using a spatial join, export to GeoPackage, and pick the analysis up in R with sf.

Author

Tidy Ecology

Published

2026-06-23

A lot of ecological analysis starts in a GUI and finishes in code. You load layers in QGIS, look at them, fix the obvious problems, and then move to R for the statistics. This post walks one small and very common version of that handoff. You have species occurrence points and a set of habitat polygons, and you want every point labelled with the patch it sits inside. QGIS does the looking and the point-in-polygon assignment; R reads the result and summarises it.

The honest framing first: the assignment itself is one line of sf. So this is not a case where QGIS does something R cannot. What QGIS gives you here is a place to see the data before you trust it, and a searchable catalogue of spatial operations you do not have to remember by name. The post shows both routes and says where each one earns its place.

The screenshots are from QGIS 3.44 LTR. The menu paths are the same in QGIS 4.0, so nothing below goes stale if you are on the newer release.

The sample data

Everything here runs on synthetic data so you can reproduce it without any field files. The chunk below builds 110 occurrence points scattered around a single coordinate, plus five habitat polygons. Some points fall outside every polygon on purpose, because a real join almost always leaves a few records unmatched, and handling those is part of the job. It also writes the two files you will open in QGIS: occurrences.csv and habitat_patches.gpkg. Note that the CSV has no habitat column. That column is exactly what the join is about to produce.

library(sf)
set.seed(46772359)

# five habitat polygons (lon/lat, EPSG:4326)
mk <- function(m) st_polygon(list(rbind(m, m[1, , drop = FALSE])))
patches <- st_sf(
  patch_id = c("P1", "P2", "P3", "P4", "P5"),
  habitat  = c("forest", "grassland", "wetland", "scrub", "arable"),
  geometry = st_sfc(
    mk(matrix(c(23.520,46.740, 23.570,46.740, 23.570,46.780, 23.520,46.780), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.580,46.750, 23.630,46.750, 23.630,46.790, 23.580,46.790), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.550,46.790, 23.600,46.790, 23.600,46.815, 23.550,46.815), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.635,46.745, 23.685,46.745, 23.685,46.785, 23.635,46.785), ncol = 2, byrow = TRUE)),
    mk(matrix(c(23.515,46.785, 23.550,46.785, 23.550,46.815, 23.515,46.815), ncol = 2, byrow = TRUE)),
    crs = 4326))

# occurrence points: oversample, then keep a controlled mix of inside and outside
cand <- st_as_sf(
  data.frame(lon = runif(2000, 23.50, 23.70), lat = runif(2000, 46.72, 46.82)),
  coords = c("lon", "lat"), crs = 4326, remove = FALSE)
membership <- st_join(cand, patches["habitat"], join = st_intersects)$habitat
inside  <- which(!is.na(membership))
outside <- which(is.na(membership))
keep <- sample(c(sample(inside, 85), sample(outside, 25)))
sel  <- cand[keep, ]
xy   <- st_coordinates(sel)

# species, with a mild habitat preference so there is a signal to find later
inside_hab <- membership[keep]
fav <- list(forest = c("sp1","sp2"), grassland = c("sp3","sp4"), wetland = c("sp5","sp6"),
            scrub = c("sp7","sp8"), arable = c("sp4","sp8"))
spp <- paste0("sp", 1:8)
species <- vapply(seq_along(inside_hab), function(i) {
  if (is.na(inside_hab[i])) return(sample(spp, 1))
  w <- setNames(rep(1, 8), spp); w[fav[[inside_hab[i]]]] <- 4
  sample(spp, 1, prob = w)
}, character(1))

occ <- data.frame(
  site_id   = sprintf("S%03d", seq_along(keep)),
  lon       = round(xy[, 1], 6),
  lat       = round(xy[, 2], 6),
  species   = species,
  abundance = rpois(length(keep), 6) + 1L
)

write.csv(occ, "occurrences.csv", row.names = FALSE)
st_write(patches, "habitat_patches.gpkg", delete_dsn = TRUE, quiet = TRUE)

knitr::kable(head(occ), caption = "First rows of the synthetic occurrence table.")

First rows of the synthetic occurrence table.
site_id	lon	lat	species	abundance
S001	23.62643	46.77211	sp4	7
S002	23.63556	46.75372	sp8	6
S003	23.65438	46.79975	sp3	7
S004	23.57296	46.80512	sp3	6
S005	23.60224	46.79576	sp6	8
S006	23.52232	46.74981	sp5	2

In QGIS

1. Load the points. Layer then Add Layer then Add Delimited Text Layer. Pick occurrences.csv, set the format to CSV, choose point geometry with X field = lon and Y field = lat, and set the geometry CRS to EPSG:4326 - WGS 84. The dialog shows you a preview of the table before you commit, which is a good moment to confirm the coordinate columns were read as numbers and not text.

The Add Delimited Text Layer dialog, with lon and lat mapped to the X and Y fields and the geometry CRS set to EPSG:4326.

2. Add the polygons and look. Drag habitat_patches.gpkg onto the canvas, or load it through Add Vector Layer. Put the point layer above the polygons in the Layers panel. Now you can see the thing you are about to compute: most points land inside a patch, and a handful sit outside every one of them. That second group is the reason the join needs a deliberate decision later.

The QGIS canvas showing occurrence points over five habitat polygons. Several points fall outside any polygon.

3. Move to a projected CRS. The points came in as longitude and latitude, which is fine for the join but awkward for anything measured in metres. Open Project then Properties then CRS (or click the CRS button at the bottom right of the window), filter for 3844, and choose EPSG:3844 - Pulkovo 1942(58) / Stereo70, the national projected system for Romania. The map reprojects on the fly. A spatial join by intersection gives the same answer in either CRS, but working in metres keeps distances and areas honest once you start measuring things.

The CRS selector filtered to 3844, with EPSG:3844 Stereo70 selected and its area of use shown.

4. Run the spatial join. Open Processing then Toolbox, and search for Join attributes by location. Set the base layer to occurrences and the join layer to habitat_patches. Use the intersects predicate, and a join type of take attributes of the first matching feature only, since each point sits in at most one patch. Leave Discard records which could not be joined unchecked. That box is the whole decision: unchecked, the 25 outside points stay in the output with an empty habitat, which is what you want so you can account for them rather than lose them silently.

The Join attributes by location dialog, set to an intersects predicate, one-to-one join, with the discard option left unchecked.

5. Export to GeoPackage. Right-click the joined layer, then Export then Save Features As. Choose GeoPackage as the format, name the file occurrences_joined.gpkg, and set the output CRS to EPSG:3844. GeoPackage is the clean way to hand vector data to R: one file, an open OGC standard, and none of the four-file shapefile baggage. That single file is the entire handoff.

Back in R

QGIS wrote occurrences_joined.gpkg, and you can read it straight back:

library(sf)
occ_tagged <- st_read("occurrences_joined.gpkg")

To keep this post reproducible without anyone repeating the clicks, here is the same point-in-polygon assignment as the single sf call it really is. This is what the figures below use, and it returns the same labels QGIS produced:

occ_sf <- st_as_sf(occ, coords = c("lon", "lat"), crs = 4326)
occ_tagged <- st_join(occ_sf, patches["habitat"], join = st_intersects)
occ_tagged$habitat[is.na(occ_tagged$habitat)] <- "unassigned"

table(occ_tagged$habitat)


    arable     forest  grassland      scrub unassigned    wetland 
        11         19         27         15         25         13

The 25 points outside every patch now carry the label unassigned instead of a missing value, which keeps them visible in summaries. Here is the joined result drawn in R, projected to Stereo70:

library(ggplot2)
hab_levels <- c("forest", "grassland", "wetland", "scrub", "arable", "unassigned")
hab_cols <- c(forest = "#2f6b3c", grassland = "#8aa84b", wetland = "#2f6f93",
              scrub = "#b06d35", arable = "#caa12e", unassigned = "#b8b3a7")

occ_3844     <- st_transform(occ_tagged, 3844)
patches_3844 <- st_transform(patches, 3844)
occ_3844$habitat     <- factor(occ_3844$habitat, levels = hab_levels)
patches_3844$habitat <- factor(patches_3844$habitat, levels = hab_levels)

ggplot() +
  geom_sf(data = patches_3844, aes(fill = habitat), colour = "grey35", linewidth = 0.3, alpha = 0.18) +
  geom_sf(data = occ_3844, aes(colour = habitat), size = 2) +
  scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") +
  scale_colour_manual(values = hab_cols, drop = FALSE, name = "habitat") +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(panel.grid = element_line(colour = "grey92"))

A map of occurrence points over five habitat polygons. Points are coloured by habitat; points outside any polygon are grey. — Figure 1: The joined points, coloured by the habitat patch each one falls in. Grey points lie outside every patch.

With a habitat column attached, the kind of summary you came for is now ordinary data work. Counting species richness and total abundance per habitat takes a single grouped summary:

library(dplyr)
per_habitat <- occ_tagged |>
  st_drop_geometry() |>
  filter(habitat != "unassigned") |>
  group_by(habitat) |>
  summarise(richness = n_distinct(species),
            total_abundance = sum(abundance),
            records = n(),
            .groups = "drop")

knitr::kable(per_habitat, caption = "Richness and abundance per habitat, after the join.")

Richness and abundance per habitat, after the join.
habitat	richness	total_abundance	records
arable	5	75	11
forest	8	108	19
grassland	7	179	27
scrub	7	106	15
wetland	7	96	13

library(tidyr)
per_habitat |>
  pivot_longer(c(richness, total_abundance), names_to = "metric", values_to = "value") |>
  mutate(metric = factor(metric, c("richness", "total_abundance"),
                         c("species richness", "total abundance")),
         habitat = factor(habitat, hab_levels)) |>
  ggplot(aes(habitat, value, fill = habitat)) +
  geom_col(width = 0.7, colour = "grey30", linewidth = 0.25) +
  facet_wrap(~ metric, scales = "free_y") +
  scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") +
  labs(x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank(),
        axis.text.x = element_text(angle = 30, hjust = 1),
        strip.text = element_text(face = "bold"))

Two bar panels. The left panel shows species richness per habitat; the right panel shows total abundance per habitat. — Figure 2: Species richness and total abundance per habitat for the matched points.

Which tool, when

A spatial join is a fair test of where the line sits. In sf it is one call, it is scripted, and it drops straight into a pipeline, so for anything you want to rerun or version it belongs in R. QGIS earns its place just before that: it is where you confirm the points landed where they should, where you catch the ones that fell in the sea because a CRS was set wrong, and where you reach for an operation by browsing the toolbox instead of recalling its name. The GeoPackage is what keeps the two honest with each other, a single standardised file passed from the place you looked at the data to the place you analyse it.

References

Pebesma, E. (2018). Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal, 10(1), 439-446. https://doi.org/10.32614/RJ-2018-009

--- title: "From QGIS to R: a spatial join and a clean GeoPackage handoff" description: "Load occurrence points in QGIS, tag each one with the habitat patch it falls in using a spatial join, export to GeoPackage, and pick the analysis up in R with sf." date: "2026-06-23 09:00" categories: [R, QGIS, sf, spatial, GeoPackage] image: thumbnail.png image-alt: "Synthetic occurrence points coloured by habitat patch, with points that fall outside every patch shown in grey." --- A lot of ecological analysis starts in a GUI and finishes in code. You load layers in QGIS, look at them, fix the obvious problems, and then move to R for the statistics. This post walks one small and very common version of that handoff. You have species occurrence points and a set of habitat polygons, and you want every point labelled with the patch it sits inside. QGIS does the looking and the point-in-polygon assignment; R reads the result and summarises it. The honest framing first: the assignment itself is one line of `sf`. So this is not a case where QGIS does something R cannot. What QGIS gives you here is a place to *see* the data before you trust it, and a searchable catalogue of spatial operations you do not have to remember by name. The post shows both routes and says where each one earns its place. The screenshots are from QGIS 3.44 LTR. The menu paths are the same in QGIS 4.0, so nothing below goes stale if you are on the newer release. ## The sample data Everything here runs on synthetic data so you can reproduce it without any field files. The chunk below builds 110 occurrence points scattered around a single coordinate, plus five habitat polygons. Some points fall outside every polygon on purpose, because a real join almost always leaves a few records unmatched, and handling those is part of the job. It also writes the two files you will open in QGIS: `occurrences.csv` and `habitat_patches.gpkg`. Note that the CSV has no `habitat` column. That column is exactly what the join is about to produce. ```{r} #| label: make-data #| message: false #| warning: false library(sf) set.seed(46772359) # five habitat polygons (lon/lat, EPSG:4326) mk <- function(m) st_polygon(list(rbind(m, m[1, , drop = FALSE]))) patches <- st_sf( patch_id = c("P1", "P2", "P3", "P4", "P5"), habitat = c("forest", "grassland", "wetland", "scrub", "arable"), geometry = st_sfc( mk(matrix(c(23.520,46.740, 23.570,46.740, 23.570,46.780, 23.520,46.780), ncol = 2, byrow = TRUE)), mk(matrix(c(23.580,46.750, 23.630,46.750, 23.630,46.790, 23.580,46.790), ncol = 2, byrow = TRUE)), mk(matrix(c(23.550,46.790, 23.600,46.790, 23.600,46.815, 23.550,46.815), ncol = 2, byrow = TRUE)), mk(matrix(c(23.635,46.745, 23.685,46.745, 23.685,46.785, 23.635,46.785), ncol = 2, byrow = TRUE)), mk(matrix(c(23.515,46.785, 23.550,46.785, 23.550,46.815, 23.515,46.815), ncol = 2, byrow = TRUE)), crs = 4326)) # occurrence points: oversample, then keep a controlled mix of inside and outside cand <- st_as_sf( data.frame(lon = runif(2000, 23.50, 23.70), lat = runif(2000, 46.72, 46.82)), coords = c("lon", "lat"), crs = 4326, remove = FALSE) membership <- st_join(cand, patches["habitat"], join = st_intersects)$habitat inside <- which(!is.na(membership)) outside <- which(is.na(membership)) keep <- sample(c(sample(inside, 85), sample(outside, 25))) sel <- cand[keep, ] xy <- st_coordinates(sel) # species, with a mild habitat preference so there is a signal to find later inside_hab <- membership[keep] fav <- list(forest = c("sp1","sp2"), grassland = c("sp3","sp4"), wetland = c("sp5","sp6"), scrub = c("sp7","sp8"), arable = c("sp4","sp8")) spp <- paste0("sp", 1:8) species <- vapply(seq_along(inside_hab), function(i) { if (is.na(inside_hab[i])) return(sample(spp, 1)) w <- setNames(rep(1, 8), spp); w[fav[[inside_hab[i]]]] <- 4 sample(spp, 1, prob = w) }, character(1)) occ <- data.frame( site_id = sprintf("S%03d", seq_along(keep)), lon = round(xy[, 1], 6), lat = round(xy[, 2], 6), species = species, abundance = rpois(length(keep), 6) + 1L ) write.csv(occ, "occurrences.csv", row.names = FALSE) st_write(patches, "habitat_patches.gpkg", delete_dsn = TRUE, quiet = TRUE) knitr::kable(head(occ), caption = "First rows of the synthetic occurrence table.") ``` ## In QGIS **1. Load the points.** `Layer` then `Add Layer` then `Add Delimited Text Layer`. Pick `occurrences.csv`, set the format to CSV, choose point geometry with `X field = lon` and `Y field = lat`, and set the geometry CRS to `EPSG:4326 - WGS 84`. The dialog shows you a preview of the table before you commit, which is a good moment to confirm the coordinate columns were read as numbers and not text. ![The Add Delimited Text Layer dialog, with lon and lat mapped to the X and Y fields and the geometry CRS set to EPSG:4326.](qgis-delimited-text.png) **2. Add the polygons and look.** Drag `habitat_patches.gpkg` onto the canvas, or load it through `Add Vector Layer`. Put the point layer above the polygons in the Layers panel. Now you can see the thing you are about to compute: most points land inside a patch, and a handful sit outside every one of them. That second group is the reason the join needs a deliberate decision later. ![The QGIS canvas showing occurrence points over five habitat polygons. Several points fall outside any polygon.](qgis-canvas.png) **3. Move to a projected CRS.** The points came in as longitude and latitude, which is fine for the join but awkward for anything measured in metres. Open `Project` then `Properties` then `CRS` (or click the CRS button at the bottom right of the window), filter for `3844`, and choose `EPSG:3844 - Pulkovo 1942(58) / Stereo70`, the national projected system for Romania. The map reprojects on the fly. A spatial join by intersection gives the same answer in either CRS, but working in metres keeps distances and areas honest once you start measuring things. ![The CRS selector filtered to 3844, with EPSG:3844 Stereo70 selected and its area of use shown.](qgis-crs.png) **4. Run the spatial join.** Open `Processing` then `Toolbox`, and search for `Join attributes by location`. Set the base layer to `occurrences` and the join layer to `habitat_patches`. Use the `intersects` predicate, and a join type of *take attributes of the first matching feature only*, since each point sits in at most one patch. Leave `Discard records which could not be joined` unchecked. That box is the whole decision: unchecked, the 25 outside points stay in the output with an empty `habitat`, which is what you want so you can account for them rather than lose them silently. ![The Join attributes by location dialog, set to an intersects predicate, one-to-one join, with the discard option left unchecked.](qgis-join-dialog.png) **5. Export to GeoPackage.** Right-click the joined layer, then `Export` then `Save Features As`. Choose GeoPackage as the format, name the file `occurrences_joined.gpkg`, and set the output CRS to `EPSG:3844`. GeoPackage is the clean way to hand vector data to R: one file, an open OGC standard, and none of the four-file shapefile baggage. That single file is the entire handoff. ## Back in R QGIS wrote `occurrences_joined.gpkg`, and you can read it straight back: ```{r} #| label: read-gpkg #| eval: false library(sf) occ_tagged <- st_read("occurrences_joined.gpkg") ``` To keep this post reproducible without anyone repeating the clicks, here is the same point-in-polygon assignment as the single `sf` call it really is. This is what the figures below use, and it returns the same labels QGIS produced: ```{r} #| label: join occ_sf <- st_as_sf(occ, coords = c("lon", "lat"), crs = 4326) occ_tagged <- st_join(occ_sf, patches["habitat"], join = st_intersects) occ_tagged$habitat[is.na(occ_tagged$habitat)] <- "unassigned" table(occ_tagged$habitat) ``` The 25 points outside every patch now carry the label `unassigned` instead of a missing value, which keeps them visible in summaries. Here is the joined result drawn in R, projected to Stereo70: ```{r} #| label: fig-map #| message: false #| warning: false #| fig-width: 8.2 #| fig-height: 6 #| fig-cap: "The joined points, coloured by the habitat patch each one falls in. Grey points lie outside every patch." #| fig-alt: "A map of occurrence points over five habitat polygons. Points are coloured by habitat; points outside any polygon are grey." library(ggplot2) hab_levels <- c("forest", "grassland", "wetland", "scrub", "arable", "unassigned") hab_cols <- c(forest = "#2f6b3c", grassland = "#8aa84b", wetland = "#2f6f93", scrub = "#b06d35", arable = "#caa12e", unassigned = "#b8b3a7") occ_3844 <- st_transform(occ_tagged, 3844) patches_3844 <- st_transform(patches, 3844) occ_3844$habitat <- factor(occ_3844$habitat, levels = hab_levels) patches_3844$habitat <- factor(patches_3844$habitat, levels = hab_levels) ggplot() + geom_sf(data = patches_3844, aes(fill = habitat), colour = "grey35", linewidth = 0.3, alpha = 0.18) + geom_sf(data = occ_3844, aes(colour = habitat), size = 2) + scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") + scale_colour_manual(values = hab_cols, drop = FALSE, name = "habitat") + labs(x = NULL, y = NULL) + theme_minimal(base_size = 13) + theme(panel.grid = element_line(colour = "grey92")) ``` With a `habitat` column attached, the kind of summary you came for is now ordinary data work. Counting species richness and total abundance per habitat takes a single grouped summary: ```{r} #| label: summary #| message: false #| warning: false library(dplyr) per_habitat <- occ_tagged |> st_drop_geometry() |> filter(habitat != "unassigned") |> group_by(habitat) |> summarise(richness = n_distinct(species), total_abundance = sum(abundance), records = n(), .groups = "drop") knitr::kable(per_habitat, caption = "Richness and abundance per habitat, after the join.") ``` ```{r} #| label: fig-summary #| message: false #| warning: false #| fig-width: 8.2 #| fig-height: 3.9 #| fig-cap: "Species richness and total abundance per habitat for the matched points." #| fig-alt: "Two bar panels. The left panel shows species richness per habitat; the right panel shows total abundance per habitat." library(tidyr) per_habitat |> pivot_longer(c(richness, total_abundance), names_to = "metric", values_to = "value") |> mutate(metric = factor(metric, c("richness", "total_abundance"), c("species richness", "total abundance")), habitat = factor(habitat, hab_levels)) |> ggplot(aes(habitat, value, fill = habitat)) + geom_col(width = 0.7, colour = "grey30", linewidth = 0.25) + facet_wrap(~ metric, scales = "free_y") + scale_fill_manual(values = hab_cols, drop = FALSE, guide = "none") + labs(x = NULL, y = NULL) + theme_minimal(base_size = 13) + theme(panel.grid.major.x = element_blank(), axis.text.x = element_text(angle = 30, hjust = 1), strip.text = element_text(face = "bold")) ``` ## Which tool, when A spatial join is a fair test of where the line sits. In `sf` it is one call, it is scripted, and it drops straight into a pipeline, so for anything you want to rerun or version it belongs in R. QGIS earns its place just before that: it is where you confirm the points landed where they should, where you catch the ones that fell in the sea because a CRS was set wrong, and where you reach for an operation by browsing the toolbox instead of recalling its name. The GeoPackage is what keeps the two honest with each other, a single standardised file passed from the place you looked at the data to the place you analyse it. ## References Pebesma, E. (2018). Simple Features for R: Standardized Support for Spatial Vector Data. *The R Journal*, 10(1), 439-446. <https://doi.org/10.32614/RJ-2018-009>