library(rgbif)
key <- name_backbone("Pulsatilla patens")$usageKey
raw <- occ_search(taxonKey = key, hasCoordinate = TRUE, limit = 2000)$data
# iNaturalist records directly, through the spocc package:
# spocc::occ(query = "Pulsatilla patens", from = "inat", limit = 2000)Cleaning GBIF and iNaturalist records in R
Every richness map and every species distribution model starts from the same raw material: a table of where a species has been recorded. The large aggregators, GBIF and iNaturalist, put millions of those records within reach of a single function call. They also hand you records sitting at zero degrees latitude and longitude, records pinned to a country centroid, museum specimens with the collector’s home town as the coordinate, and the same observation entered three times. None of that is unusual, and none of it is the aggregator’s fault: the data is a merge of thousands of sources. The work is yours, before any analysis is trustworthy. This post walks the cleaning that turns a raw pull into a defensible set of points.
Getting the records
The rgbif package is the R client for the GBIF API. For a single species you look up its backbone key, then request georeferenced records. iNaturalist research-grade observations flow into GBIF as well, so the same pull reaches them; for iNaturalist directly, spocc wraps its API. The call below is shown but not run, because a live pull is a moving target: the database changes daily and the API rate-limits, so nothing here would reproduce.
For a large or repeatable download use occ_download() rather than occ_search(): it queues a full extract server-side and returns a citable DOI, which the GBIF data-use terms ask you to keep. What follows works from a small saved extract bundled with this post, with the same columns occ_search() returns. The coordinates are illustrative, not a real locality.
library(dplyr)
library(sf)
library(ggplot2)
paper <- "#f5f4ee"; ink <- "#16241d"; forest <- "#275139"
gold <- "#cda23f"; abandoned <- "#b5534e"; faint <- "#5d6b61"; line <- "#dad9ca"
theme_te <- theme_minimal(base_size = 12) +
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(colour = line, linewidth = 0.3),
plot.background = element_rect(fill = paper, colour = NA),
panel.background = element_rect(fill = paper, colour = NA),
axis.title = element_text(colour = ink),
axis.text = element_text(colour = faint),
plot.title = element_text(colour = ink, face = "bold"))records <- read.csv("occurrences_raw.csv", stringsAsFactors = FALSE,
na.strings = "")
nrow(records)[1] 43
Forty-three records, with the fields GBIF returns: a species name, decimal latitude and longitude, a coordinate uncertainty in metres, the basis of record, a year, an occurrence status, and a country code.
head(records[, c("decimalLatitude", "decimalLongitude",
"coordinateUncertaintyInMeters", "basisOfRecord",
"occurrenceStatus")], 4) decimalLatitude decimalLongitude coordinateUncertaintyInMeters
1 47.09602 26.46606 1000
2 46.66959 22.05218 1000
3 45.42042 25.38712 250
4 45.85717 21.09694 1000
basisOfRecord occurrenceStatus
1 PRESERVED_SPECIMEN PRESENT
2 HUMAN_OBSERVATION PRESENT
3 HUMAN_OBSERVATION <NA>
4 OBSERVATION <NA>
The problems hiding in this table are the common ones: rows with no coordinate at all, a row at exactly zero latitude and longitude (the “null island” off West Africa where missing values often land), an impossible latitude, coordinates so imprecise they could be anywhere in a county, exact duplicates, an absence dressed as a record, and a fossil that has no place in a present-day distribution. Each cleaning step below targets one of these, and reports how many rows are left.
Coordinates that could be real
The first pass keeps only rows whose coordinates exist and fall inside the possible range. Latitude runs from -90 to 90 and longitude from -180 to 180; anything outside is a data-entry error. The zero-zero point is dropped explicitly, because it is almost always a missing value that was silently coerced rather than a genuine record in the Gulf of Guinea.
clean <- records |>
filter(!is.na(decimalLatitude), !is.na(decimalLongitude)) |>
filter(decimalLatitude != 0 | decimalLongitude != 0) |>
filter(between(decimalLatitude, -90, 90),
between(decimalLongitude, -180, 180))
nrow(clean)[1] 39
That removes four rows: the two with a missing coordinate, the null-island point, and the impossible latitude. Thirty-nine remain.
Presence, and living records
An occurrence table can carry absences and fossils, and neither belongs in a map of where a species currently grows. GBIF marks absences with occurrenceStatus, and records the kind of evidence in basisOfRecord. Here we keep presences (or rows that leave the field blank) and observation or specimen records, which drops the fossil.
clean <- clean |>
filter(is.na(occurrenceStatus) | occurrenceStatus == "PRESENT") |>
filter(basisOfRecord %in% c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN",
"MACHINE_OBSERVATION", "OBSERVATION"))
nrow(clean)[1] 37
Two more gone, the flagged absence and the fossil specimen, leaving thirty-seven.
Precision you can use
A coordinate carries a stated uncertainty. A record good to 100 metres and one good to 50 kilometres are not the same evidence, and mixing them quietly degrades any analysis at a fine grain. The right ceiling depends on the question; for a two-kilometre richness grid, ten kilometres is already generous. Rows with no stated uncertainty are kept here rather than discarded, since a blank is common and not the same as a bad value, but that is a judgement worth stating out loud.
clean <- clean |>
filter(is.na(coordinateUncertaintyInMeters) |
coordinateUncertaintyInMeters <= 10000)
nrow(clean)[1] 35
The two coordinates uncertain to fifty kilometres drop out. Thirty-five left.
The same record, twice
Merged databases duplicate. The same observation can arrive from the original recorder and again through an aggregating dataset, identical in species and position. Collapsing on the fields that define a duplicate removes the copies while keeping the first of each.
clean <- clean |>
distinct(species, decimalLatitude, decimalLongitude, .keep_all = TRUE)
nrow(clean)[1] 32
Three duplicate rows go, and thirty-two remain.
Inside the study region
The last step is spatial, so the records become an sf object. Turning the two coordinate columns into geometry with st_as_sf() lets the points be tested against a polygon: here a bounding box around the region of interest, which discards the two stray points that sit far to the west, one of them out in the Atlantic.
pts <- st_as_sf(clean, coords = c("decimalLongitude", "decimalLatitude"),
crs = 4326, remove = FALSE)
region <- st_as_sfc(st_bbox(c(xmin = 20, ymin = 44, xmax = 28, ymax = 49),
crs = st_crs(4326)))
inside <- st_within(pts, region, sparse = FALSE)[, 1]
pts <- pts[inside, ]
nrow(pts)[1] 30
Thirty points survive from the forty-three we started with. Thirteen removals, none of them arbitrary: each answered a specific, statable objection to a record.
ggplot() +
geom_sf(data = region, fill = NA, colour = faint, linewidth = 0.4) +
geom_sf(data = pts, colour = forest, size = 2.4, alpha = 0.9) +
labs(x = NULL, y = NULL) +
theme_te
Doing this at scale
Cleaning by hand is worth doing once, to see what each rule removes. For many species or millions of records, the CoordinateCleaner package packages the same logic and adds tests that are awkward to write yourself: flagging points on country and province centroids, points at the GBIF headquarters or at biodiversity institutions, and points in the sea, all against built-in gazetteers.
library(CoordinateCleaner)
flags <- clean_coordinates(
records,
lon = "decimalLongitude", lat = "decimalLatitude", species = "species",
tests = c("zeros", "equal", "duplicates", "seas", "centroids", "institutions"))
clean <- records[flags$.summary, ] # keep records that passed every testThe manual pass and the package agree on the principle: a cleaning decision is only defensible if you can say which problem it fixes.
Where to go next
The cleaned points are the input the rest of the spatial workflow assumes. Feeding them to a grid gives a species richness map; a continuous environmental surface calls for raster work with terra; and either feeds a species distribution model. The other half of cleaning is taxonomic: reconciling synonyms and misspelt names so that records for one species are actually counted together, which rgbif’s name_backbone() and the taxize package handle.
References
Chamberlain SA, Boettiger C 2017 PeerJ Preprints 5:e3304v1 (10.7287/peerj.preprints.3304v1)
Zizka A, Silvestro D, Andermann T, Azevedo J, Duarte Ritter C, Edler D, Farooq H, Herdean A, Ariza M, Scharn R, Svantesson S, Wengstrom N, Zizka V, Antonelli A 2019 Methods in Ecology and Evolution 10(5):744-751 (10.1111/2041-210X.13152)
Pebesma E 2018 The R Journal 10(1):439-446 (10.32614/RJ-2018-009)