Cleaning GBIF and iNaturalist records in R

data cleaning

GBIF

ecology tutorial

Occurrence records from GBIF and iNaturalist need cleaning. In R: drop impossible and duplicate coordinates, flag low precision, and map the result with sf.

Author

Tidy Ecology

Published

2026-07-02

Every richness map and every species distribution model starts from the same raw material: a table of where a species has been recorded. The large aggregators, GBIF and iNaturalist, put millions of those records within reach of a single function call. They also hand you records sitting at zero degrees latitude and longitude, records pinned to a country centroid, museum specimens with the collector’s home town as the coordinate, and the same observation entered three times. None of that is unusual, and none of it is the aggregator’s fault: the data is a merge of thousands of sources. The work is yours, before any analysis is trustworthy. This post walks the cleaning that turns a raw pull into a defensible set of points.

Getting the records

The rgbif package is the R client for the GBIF API. For a single species you look up its backbone key, then request georeferenced records. iNaturalist research-grade observations flow into GBIF as well, so the same pull reaches them; for iNaturalist directly, spocc wraps its API. The call below is shown but not run, because a live pull is a moving target: the database changes daily and the API rate-limits, so nothing here would reproduce.

library(rgbif)

key <- name_backbone("Pulsatilla patens")$usageKey
raw <- occ_search(taxonKey = key, hasCoordinate = TRUE, limit = 2000)$data

# iNaturalist records directly, through the spocc package:
# spocc::occ(query = "Pulsatilla patens", from = "inat", limit = 2000)

For a large or repeatable download use occ_download() rather than occ_search(): it queues a full extract server-side and returns a citable DOI, which the GBIF data-use terms ask you to keep. What follows works from a small saved extract bundled with this post, with the same columns occ_search() returns. The coordinates are illustrative, not a real locality.

library(dplyr)
library(sf)
library(ggplot2)

paper <- "#f5f4ee"; ink <- "#16241d"; forest <- "#275139"
gold  <- "#cda23f"; abandoned <- "#b5534e"; faint <- "#5d6b61"; line <- "#dad9ca"

theme_te <- theme_minimal(base_size = 12) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_line(colour = line, linewidth = 0.3),
        plot.background = element_rect(fill = paper, colour = NA),
        panel.background = element_rect(fill = paper, colour = NA),
        axis.title = element_text(colour = ink),
        axis.text = element_text(colour = faint),
        plot.title = element_text(colour = ink, face = "bold"))

records <- read.csv("occurrences_raw.csv", stringsAsFactors = FALSE,
                    na.strings = "")
nrow(records)

[1] 43

Forty-three records, with the fields GBIF returns: a species name, decimal latitude and longitude, a coordinate uncertainty in metres, the basis of record, a year, an occurrence status, and a country code.

head(records[, c("decimalLatitude", "decimalLongitude",
                 "coordinateUncertaintyInMeters", "basisOfRecord",
                 "occurrenceStatus")], 4)

  decimalLatitude decimalLongitude coordinateUncertaintyInMeters
1        47.09602         26.46606                          1000
2        46.66959         22.05218                          1000
3        45.42042         25.38712                           250
4        45.85717         21.09694                          1000
       basisOfRecord occurrenceStatus
1 PRESERVED_SPECIMEN          PRESENT
2  HUMAN_OBSERVATION          PRESENT
3  HUMAN_OBSERVATION             <NA>
4        OBSERVATION             <NA>

The problems hiding in this table are the common ones: rows with no coordinate at all, a row at exactly zero latitude and longitude (the “null island” off West Africa where missing values often land), an impossible latitude, coordinates so imprecise they could be anywhere in a county, exact duplicates, an absence dressed as a record, and a fossil that has no place in a present-day distribution. Each cleaning step below targets one of these, and reports how many rows are left.

Coordinates that could be real

The first pass keeps only rows whose coordinates exist and fall inside the possible range. Latitude runs from -90 to 90 and longitude from -180 to 180; anything outside is a data-entry error. The zero-zero point is dropped explicitly, because it is almost always a missing value that was silently coerced rather than a genuine record in the Gulf of Guinea.

clean <- records |>
  filter(!is.na(decimalLatitude), !is.na(decimalLongitude)) |>
  filter(decimalLatitude != 0 | decimalLongitude != 0) |>
  filter(between(decimalLatitude, -90, 90),
         between(decimalLongitude, -180, 180))
nrow(clean)

[1] 39

That removes four rows: the two with a missing coordinate, the null-island point, and the impossible latitude. Thirty-nine remain.

Presence, and living records

An occurrence table can carry absences and fossils, and neither belongs in a map of where a species currently grows. GBIF marks absences with occurrenceStatus, and records the kind of evidence in basisOfRecord. Here we keep presences (or rows that leave the field blank) and observation or specimen records, which drops the fossil.

clean <- clean |>
  filter(is.na(occurrenceStatus) | occurrenceStatus == "PRESENT") |>
  filter(basisOfRecord %in% c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN",
                              "MACHINE_OBSERVATION", "OBSERVATION"))
nrow(clean)

[1] 37

Two more gone, the flagged absence and the fossil specimen, leaving thirty-seven.

Precision you can use

A coordinate carries a stated uncertainty. A record good to 100 metres and one good to 50 kilometres are not the same evidence, and mixing them quietly degrades any analysis at a fine grain. The right ceiling depends on the question; for a two-kilometre richness grid, ten kilometres is already generous. Rows with no stated uncertainty are kept here rather than discarded, since a blank is common and not the same as a bad value, but that is a judgement worth stating out loud.

clean <- clean |>
  filter(is.na(coordinateUncertaintyInMeters) |
         coordinateUncertaintyInMeters <= 10000)
nrow(clean)

[1] 35

The two coordinates uncertain to fifty kilometres drop out. Thirty-five left.

The same record, twice

Merged databases duplicate. The same observation can arrive from the original recorder and again through an aggregating dataset, identical in species and position. Collapsing on the fields that define a duplicate removes the copies while keeping the first of each.

clean <- clean |>
  distinct(species, decimalLatitude, decimalLongitude, .keep_all = TRUE)
nrow(clean)

[1] 32

Three duplicate rows go, and thirty-two remain.

Inside the study region

The last step is spatial, so the records become an sf object. Turning the two coordinate columns into geometry with st_as_sf() lets the points be tested against a polygon: here a bounding box around the region of interest, which discards the two stray points that sit far to the west, one of them out in the Atlantic.

pts <- st_as_sf(clean, coords = c("decimalLongitude", "decimalLatitude"),
                crs = 4326, remove = FALSE)

region <- st_as_sfc(st_bbox(c(xmin = 20, ymin = 44, xmax = 28, ymax = 49),
                            crs = st_crs(4326)))
inside <- st_within(pts, region, sparse = FALSE)[, 1]
pts    <- pts[inside, ]
nrow(pts)

[1] 30

Thirty points survive from the forty-three we started with. Thirteen removals, none of them arbitrary: each answered a specific, statable objection to a record.

ggplot() +
  geom_sf(data = region, fill = NA, colour = faint, linewidth = 0.4) +
  geom_sf(data = pts, colour = forest, size = 2.4, alpha = 0.9) +
  labs(x = NULL, y = NULL) +
  theme_te

A map showing the rectangular study-region boundary over central Europe with thirty green points scattered across its interior, none outside the box. — Figure 1: The thirty records that survived cleaning, inside the study region.

Doing this at scale

Cleaning by hand is worth doing once, to see what each rule removes. For many species or millions of records, the CoordinateCleaner package packages the same logic and adds tests that are awkward to write yourself: flagging points on country and province centroids, points at the GBIF headquarters or at biodiversity institutions, and points in the sea, all against built-in gazetteers.

library(CoordinateCleaner)

flags <- clean_coordinates(
  records,
  lon = "decimalLongitude", lat = "decimalLatitude", species = "species",
  tests = c("zeros", "equal", "duplicates", "seas", "centroids", "institutions"))

clean <- records[flags$.summary, ]   # keep records that passed every test

The manual pass and the package agree on the principle: a cleaning decision is only defensible if you can say which problem it fixes.

Where to go next

The cleaned points are the input the rest of the spatial workflow assumes. Feeding them to a grid gives a species richness map; a continuous environmental surface calls for raster work with terra; and either feeds a species distribution model. The other half of cleaning is taxonomic: reconciling synonyms and misspelt names so that records for one species are actually counted together, which rgbif’s name_backbone() and the taxize package handle.

References

Chamberlain SA, Boettiger C 2017 PeerJ Preprints 5:e3304v1 (10.7287/peerj.preprints.3304v1)

Zizka A, Silvestro D, Andermann T, Azevedo J, Duarte Ritter C, Edler D, Farooq H, Herdean A, Ariza M, Scharn R, Svantesson S, Wengstrom N, Zizka V, Antonelli A 2019 Methods in Ecology and Evolution 10(5):744-751 (10.1111/2041-210X.13152)

Pebesma E 2018 The R Journal 10(1):439-446 (10.32614/RJ-2018-009)

--- title: "Cleaning GBIF and iNaturalist records in R" description: "Occurrence records from GBIF and iNaturalist need cleaning. In R: drop impossible and duplicate coordinates, flag low precision, and map the result with sf." date: "2026-07-02 10:00" categories: [R, sf, data cleaning, GBIF, ecology tutorial] image: thumbnail.png image-alt: "A study-region rectangle over central Europe with thirty green points scattered inside it, the occurrence records that survived cleaning." --- Every richness map and every species distribution model starts from the same raw material: a table of where a species has been recorded. The large aggregators, GBIF and iNaturalist, put millions of those records within reach of a single function call. They also hand you records sitting at zero degrees latitude and longitude, records pinned to a country centroid, museum specimens with the collector's home town as the coordinate, and the same observation entered three times. None of that is unusual, and none of it is the aggregator's fault: the data is a merge of thousands of sources. The work is yours, before any analysis is trustworthy. This post walks the cleaning that turns a raw pull into a defensible set of points. ## Getting the records The `rgbif` package is the R client for the GBIF API. For a single species you look up its backbone key, then request georeferenced records. iNaturalist research-grade observations flow into GBIF as well, so the same pull reaches them; for iNaturalist directly, `spocc` wraps its API. The call below is shown but not run, because a live pull is a moving target: the database changes daily and the API rate-limits, so nothing here would reproduce. ```{r} #| eval: false library(rgbif) key <- name_backbone("Pulsatilla patens")$usageKey raw <- occ_search(taxonKey = key, hasCoordinate = TRUE, limit = 2000)$data # iNaturalist records directly, through the spocc package: # spocc::occ(query = "Pulsatilla patens", from = "inat", limit = 2000) ``` For a large or repeatable download use `occ_download()` rather than `occ_search()`: it queues a full extract server-side and returns a citable DOI, which the GBIF data-use terms ask you to keep. What follows works from a small saved extract bundled with this post, with the same columns `occ_search()` returns. The coordinates are illustrative, not a real locality. ```{r setup} #| message: false #| warning: false library(dplyr) library(sf) library(ggplot2) paper <- "#f5f4ee"; ink <- "#16241d"; forest <- "#275139" gold <- "#cda23f"; abandoned <- "#b5534e"; faint <- "#5d6b61"; line <- "#dad9ca" theme_te <- theme_minimal(base_size = 12) + theme(panel.grid.minor = element_blank(), panel.grid.major = element_line(colour = line, linewidth = 0.3), plot.background = element_rect(fill = paper, colour = NA), panel.background = element_rect(fill = paper, colour = NA), axis.title = element_text(colour = ink), axis.text = element_text(colour = faint), plot.title = element_text(colour = ink, face = "bold")) ``` ```{r read} records <- read.csv("occurrences_raw.csv", stringsAsFactors = FALSE, na.strings = "") nrow(records) ``` Forty-three records, with the fields GBIF returns: a species name, decimal latitude and longitude, a coordinate uncertainty in metres, the basis of record, a year, an occurrence status, and a country code. ```{r peek} head(records[, c("decimalLatitude", "decimalLongitude", "coordinateUncertaintyInMeters", "basisOfRecord", "occurrenceStatus")], 4) ``` The problems hiding in this table are the common ones: rows with no coordinate at all, a row at exactly zero latitude and longitude (the "null island" off West Africa where missing values often land), an impossible latitude, coordinates so imprecise they could be anywhere in a county, exact duplicates, an absence dressed as a record, and a fossil that has no place in a present-day distribution. Each cleaning step below targets one of these, and reports how many rows are left. ## Coordinates that could be real The first pass keeps only rows whose coordinates exist and fall inside the possible range. Latitude runs from -90 to 90 and longitude from -180 to 180; anything outside is a data-entry error. The zero-zero point is dropped explicitly, because it is almost always a missing value that was silently coerced rather than a genuine record in the Gulf of Guinea. ```{r step-coords} clean <- records |> filter(!is.na(decimalLatitude), !is.na(decimalLongitude)) |> filter(decimalLatitude != 0 | decimalLongitude != 0) |> filter(between(decimalLatitude, -90, 90), between(decimalLongitude, -180, 180)) nrow(clean) ``` That removes four rows: the two with a missing coordinate, the null-island point, and the impossible latitude. Thirty-nine remain. ## Presence, and living records An occurrence table can carry absences and fossils, and neither belongs in a map of where a species currently grows. GBIF marks absences with `occurrenceStatus`, and records the kind of evidence in `basisOfRecord`. Here we keep presences (or rows that leave the field blank) and observation or specimen records, which drops the fossil. ```{r step-basis} clean <- clean |> filter(is.na(occurrenceStatus) | occurrenceStatus == "PRESENT") |> filter(basisOfRecord %in% c("HUMAN_OBSERVATION", "PRESERVED_SPECIMEN", "MACHINE_OBSERVATION", "OBSERVATION")) nrow(clean) ``` Two more gone, the flagged absence and the fossil specimen, leaving thirty-seven. ## Precision you can use A coordinate carries a stated uncertainty. A record good to 100 metres and one good to 50 kilometres are not the same evidence, and mixing them quietly degrades any analysis at a fine grain. The right ceiling depends on the question; for a two-kilometre richness grid, ten kilometres is already generous. Rows with no stated uncertainty are kept here rather than discarded, since a blank is common and not the same as a bad value, but that is a judgement worth stating out loud. ```{r step-precision} clean <- clean |> filter(is.na(coordinateUncertaintyInMeters) | coordinateUncertaintyInMeters <= 10000) nrow(clean) ``` The two coordinates uncertain to fifty kilometres drop out. Thirty-five left. ## The same record, twice Merged databases duplicate. The same observation can arrive from the original recorder and again through an aggregating dataset, identical in species and position. Collapsing on the fields that define a duplicate removes the copies while keeping the first of each. ```{r step-dupes} clean <- clean |> distinct(species, decimalLatitude, decimalLongitude, .keep_all = TRUE) nrow(clean) ``` Three duplicate rows go, and thirty-two remain. ## Inside the study region The last step is spatial, so the records become an `sf` object. Turning the two coordinate columns into geometry with `st_as_sf()` lets the points be tested against a polygon: here a bounding box around the region of interest, which discards the two stray points that sit far to the west, one of them out in the Atlantic. ```{r step-region} pts <- st_as_sf(clean, coords = c("decimalLongitude", "decimalLatitude"), crs = 4326, remove = FALSE) region <- st_as_sfc(st_bbox(c(xmin = 20, ymin = 44, xmax = 28, ymax = 49), crs = st_crs(4326))) inside <- st_within(pts, region, sparse = FALSE)[, 1] pts <- pts[inside, ] nrow(pts) ``` Thirty points survive from the forty-three we started with. Thirteen removals, none of them arbitrary: each answered a specific, statable objection to a record. ```{r} #| label: fig-clean #| fig-cap: "The thirty records that survived cleaning, inside the study region." #| fig-alt: "A map showing the rectangular study-region boundary over central Europe with thirty green points scattered across its interior, none outside the box." #| fig-width: 7 #| fig-height: 4.6 ggplot() + geom_sf(data = region, fill = NA, colour = faint, linewidth = 0.4) + geom_sf(data = pts, colour = forest, size = 2.4, alpha = 0.9) + labs(x = NULL, y = NULL) + theme_te ``` ## Doing this at scale Cleaning by hand is worth doing once, to see what each rule removes. For many species or millions of records, the `CoordinateCleaner` package packages the same logic and adds tests that are awkward to write yourself: flagging points on country and province centroids, points at the GBIF headquarters or at biodiversity institutions, and points in the sea, all against built-in gazetteers. ```{r} #| eval: false library(CoordinateCleaner) flags <- clean_coordinates( records, lon = "decimalLongitude", lat = "decimalLatitude", species = "species", tests = c("zeros", "equal", "duplicates", "seas", "centroids", "institutions")) clean <- records[flags$.summary, ] # keep records that passed every test ``` The manual pass and the package agree on the principle: a cleaning decision is only defensible if you can say which problem it fixes. ## Where to go next The cleaned points are the input the rest of the spatial workflow assumes. Feeding them to a grid gives a [species richness map](../richness-mapping-sf/); a continuous environmental surface calls for [raster work with terra](../terra-raster-basics/); and either feeds a species distribution model. The other half of cleaning is taxonomic: reconciling synonyms and misspelt names so that records for one species are actually counted together, which `rgbif`'s `name_backbone()` and the `taxize` package handle. ## References Chamberlain SA, Boettiger C 2017 PeerJ Preprints 5:e3304v1 (10.7287/peerj.preprints.3304v1) Zizka A, Silvestro D, Andermann T, Azevedo J, Duarte Ritter C, Edler D, Farooq H, Herdean A, Ariza M, Scharn R, Svantesson S, Wengstrom N, Zizka V, Antonelli A 2019 Methods in Ecology and Evolution 10(5):744-751 (10.1111/2041-210X.13152) Pebesma E 2018 The R Journal 10(1):439-446 (10.32614/RJ-2018-009) ## Related tutorials - [Mapping species richness with sf](../richness-mapping-sf/) - [Raster basics with terra](../terra-raster-basics/) - [QGIS to R spatial join](../qgis-to-r-spatial-join/) - [Computing diversity indices](../diversity-indices-in-r/)