Handling Data with rgbif

Preamble, Package-Loading, and GBIF API Credential Registering (click here):
## Custom install & load function
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE)) {
    install.packages(x, repos = "http://cran.us.r-project.org")
  }
  require(x, character.only = TRUE)
}
## names of packages we want installed (if not installed yet) and loaded
package_vec <- c(
  "rgbif",
  "knitr", # for rmarkdown table visualisations
  "sp", # for spatialobject creation
  "sf", # an alternative spatial object library
  "ggplot2", # for visualistion
  "maptools", # for visualisation
  "rgeos", # for visualisation
  "raster", # for setting and reading CRS
  "rnaturalearth" # for shapefiles of naturalearth
)
## executing install & load for each package
sapply(package_vec, install.load.package)
##         rgbif         knitr            sp            sf       ggplot2      maptools 
##          TRUE          TRUE          TRUE          TRUE          TRUE          TRUE 
##         rgeos        raster rnaturalearth 
##          TRUE          TRUE          TRUE
options(gbif_user = "my gbif username")
options(gbif_email = "my registred gbif e-mail")
options(gbif_pwd = "my gbif password")

First, we obtain and load the data we are interested in like such:

# Download query
res <- occ_download(
  pred("taxonKey", sp_key),
  pred_in("basisOfRecord", c("HUMAN_OBSERVATION")),
  pred("country", "NO"),
  pred("hasCoordinate", TRUE),
  pred_gte("year", 2000),
  pred_lte("year", 2022)
)
# Downloading and Loading
res_meta <- occ_download_wait(res, status_ping = 5, curlopts = list(), quiet = FALSE)
res_get <- occ_download_get(res)
res_data <- occ_download_import(res_get)

With this data in our R environment, we are ready to explore the data itself.

There are a myriad of things you can do to and with GBIF mediated data - here, we focus only on a few data handling steps.

Initial Data Handling

Before working with the data you obtained via GBIF, it is usually good practice to first check that all data is as expected/in order and then either reduce the dataset further to fit quality standards and extract the relevant variables for your application.

Common Data Considerations & Issues

Common data considerations and quality flags are largely related to geolocations (but other quality markers do exist). These can be used as limiting factors in data discovery, when querying downloads as well as after a download is done and the data is loaded. Within the GBIF Portal, these options are presented in a side-bar like this:

As a matter of fact, we have already used the functionality by which to control for data quality markers when carrying out data discovery (occ_search(...)) and data download queries (occ_download(...)) by matching Darwin Core Terms like basisOfRecord or hasCoordinate.

For this exercise, let’s focus on some data markers that are contained in our already downloaded data set which we may want to use for further limiting of our data set for subsequent analyses. To do so, let’s consider the coordinateUncertaintyInMeters field by visualising the values we have obtained for each record in our occurrence data:

ggplot(res_data, aes(x = coordinateUncertaintyInMeters)) +
  geom_histogram(bins = 1e2) +
  theme_bw() +
  scale_y_continuous(trans = "log10")

Note: The y-axis on the above plot is log-transformed and 1053 of the underlying records do not report a value for coordinateUncertaintyInMeters thus being removed from the above visualisation.

What we find is that there exists considerable variation in confidence of individual occurrence locations and we probably want to remove those records which are assigned a certain level of coordinateUncertaintyInMeters. Let’s say 10 metres:

preci_data <- res_data[which(res_data$coordinateUncertaintyInMeters < 10), ]

This quality control leaves us with 6802 Calluna vulgaris records. A significant drop in data points which may well change our analyses and their outcomes drastically.

Extract a Subset of Cata-Columns

GBIF mediated data comes with a lot of attributes. These can be assessed readily via the Darwin Core or, within R via: colnames(...) (here with ... = res_data). Rarely will we need all of them for our analyses. For now, we will simply subset the data for a smaller set of columns which are often relevant for end-users:

data_subset <- preci_data[
  ,
  c("scientificName", "decimalLongitude", "decimalLatitude", "basisOfRecord", "year", "month", "day", "eventDate", "countryCode", "municipality", "taxonKey", "species", "catalogNumber", "hasGeospatialIssues", "hasCoordinate", "datasetKey")
]
knitr::kable(head(data_subset))
scientificName decimalLongitude decimalLatitude basisOfRecord year month day eventDate countryCode municipality taxonKey species catalogNumber hasGeospatialIssues hasCoordinate datasetKey
Calluna vulgaris (L.) Hull 6.704158 62.66793 HUMAN_OBSERVATION 2012 9 28 2012-09-28 NO 2882482 Calluna vulgaris 30179 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef
Calluna vulgaris (L.) Hull 6.868125 62.72794 HUMAN_OBSERVATION 2012 9 28 2012-09-28 NO 2882482 Calluna vulgaris 30076 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef
Calluna vulgaris (L.) Hull 6.883944 62.74096 HUMAN_OBSERVATION 2012 9 28 2012-09-28 NO 2882482 Calluna vulgaris 29952 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef
Calluna vulgaris (L.) Hull 6.614745 62.68688 HUMAN_OBSERVATION 2012 9 22 2012-09-22 NO 2882482 Calluna vulgaris 29905 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef
Calluna vulgaris (L.) Hull 6.544501 62.65040 HUMAN_OBSERVATION 2012 9 22 2012-09-22 NO 2882482 Calluna vulgaris 29879 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef
Calluna vulgaris (L.) Hull 6.548010 62.64857 HUMAN_OBSERVATION 2012 9 22 2012-09-22 NO 2882482 Calluna vulgaris 29814 FALSE TRUE 09c38deb-8674-446e-8be8-3347f6c094ef

Explore the Occurrence Data

Now that we have the data we might use for analyses ready, we can explore what the data itself contains.

Data Contents

Here are a few overviews of Calluna vulgaris abundances across different data attributes:

table(data_subset$year)
## 
## 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 
##    2    5    1   11    4    8    8   12   37  313  509  380  255  255  349  404  505  502 
## 2018 2019 2020 2021 2022 
##  256  369  427 1022 1168
table(data_subset$stateProvince)
## < table of extent 0 >
table(data_subset$mediaType)
## < table of extent 0 >

Spatial Data Handling

Most use-cases of GBIF make use of the geolocation references for data records either implicitly or explicitly. It is thus vital to be able to handle GBIF mediated data for spatial analyses. There exist plenty workshop (like this one) already for this topic so I will be brief.

Make SpatialPointsDataFrame

First, we can use the sp package to create SpatialPoints from our geo-referenced occurrence data:

options(digits = 8) ## set 8 digits (ie. all digits, not decimals) for the type cast as.double to keep decimals
data_subset <- as.data.frame(data_subset)
data_subset$lon <- as.double(data_subset$decimalLongitude) ## cast lon from char to double
data_subset$lat <- as.double(data_subset$decimalLatitude) ## cast lat from char to double
coordinates(data_subset) <- ~ lon + lat ## make data_subset into SpatialPointsDataFrame
proj4string(data_subset) <- CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0") ## set CRS

This data format lends itself well for analysing where occurrence have been recorded in relation to study parameters of choice (e.g., climatic conditions, land-use, political boundaries, etc.).

SpatialPoints and Polygons

In first instance, SpatialPoints can easily be used to create initial visualisations of spatial patterns:

## Data Handling
data_xy <- data_subset[c("decimalLongitude", "decimalLatitude")] ## Extract only the coordinates
data(wrld_simpl)
norway_mask <- subset(wrld_simpl, NAME == "Norway")
## Plotting
plot(norway_mask, axes = TRUE)
title("Calluna vulgaris presences recorded by human observation between 2000 and 2022 across Norway")
points(data_xy, col = "red", pch = 20, cex = 1) # plot species occurrence points to the map
legend("bottomright", title = "Legend", legend = "Occurrences", pch = 20, col = "red", cex = 0.9)

The Coordinate Reference System (CRS)

Each spatial object in R is assigned a Coordinate Reference System (CRS) which details how geolocational values are to be understood. For an overview of different CRSs, see here.

In R, we can assess the CRS of most spatial objects as follows:

raster::crs(wrld_simpl)
## Coordinate Reference System:
## Deprecated Proj.4 representation:
##  +proj=longlat +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +no_defs 
## WKT2 2019 representation:
## GEOGCRS["unknown",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ID["EPSG",6326]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433],
##         ID["EPSG",8901]],
##     CS[ellipsoidal,2],
##         AXIS["longitude",east,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]],
##         AXIS["latitude",north,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]]]
raster::crs(data_subset)
## Coordinate Reference System:
## Deprecated Proj.4 representation: +proj=longlat +datum=WGS84 +no_defs 
## WKT2 2019 representation:
## GEOGCRS["unknown",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]],
##         ID["EPSG",6326]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433],
##         ID["EPSG",8901]],
##     CS[ellipsoidal,2],
##         AXIS["longitude",east,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]],
##         AXIS["latitude",north,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433,
##                 ID["EPSG",9122]]]]

When dealing with data in specific areas of the world or wanting to match occurrence data to other products with specific CRSs, it may be prudent to reproject the SpatialPoints occurrence data object. We can use sp:spTransform() to do so (this is reprojecting to the same CRS the data is already in):

sp::spTransform(data_subset, CRS("+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs +towgs84=0,0,0"))
## class       : SpatialPointsDataFrame 
## features    : 6802 
## extent      : 4.619497, 30.051781, 57.969404, 70.981838  (xmin, xmax, ymin, ymax)
## crs         : +proj=longlat +datum=WGS84 +no_defs 
## variables   : 16
## names       :             scientificName, decimalLongitude, decimalLatitude,     basisOfRecord, year, month, day,  eventDate, countryCode, municipality, taxonKey,          species, catalogNumber, hasGeospatialIssues, hasCoordinate, ... 
## min values  : Calluna vulgaris (L.) Hull,         4.619497,       57.969404, HUMAN_OBSERVATION, 2000,     1,   1,  961891200,          NO,             ,  2882482, Calluna vulgaris,              ,                   0,             1, ... 
## max values  : Calluna vulgaris (L.) Hull,        30.051781,       70.981838, HUMAN_OBSERVATION, 2022,    12,  31, 1671235200,          NO,         Voss,  2882482, Calluna vulgaris, OBS.259608743,                   0,             1, ...

Classifying Spatial Data

Moving from the relatively limited sp package to the more functional sf package enables more advanced visualisations of additional spatial considerations of our data.

For example, consider we want to quantify abundances of Calluna vulgaris across political regions in Norway:

## Obtain sf object
data_sf <- st_as_sf(data_subset) # make sp into sf
NO_municip <- rnaturalearth::ne_states(country = "Norway", returnclass = "sf") # get shapefiles for Norwegian states
NO_municip <- sf::st_crop(NO_municip, extent(4.5, 31.5, 50, 71.5)) # crop shapefile to continental Norway
## Identify overlap of points and polygons
cover_sf <- st_intersects(NO_municip, data_sf)
names(cover_sf) <- NO_municip$name
## report abundances
abundances_municip <- unlist(lapply(cover_sf, length))
sort(abundances_municip, decreasing = TRUE)
##          Østfold  Møre og Romsdal         Buskerud       Vest-Agder         Telemark 
##             1838             1173              366              315              283 
##         Rogaland         Akershus        Hordaland       Aust-Agder    Sør-Trøndelag 
##              263              199              197              172              150 
##         Nordland          Hedmark Sogn og Fjordane             Oslo          Oppland 
##              129              127              115              106              104 
##         Vestfold   Nord-Trøndelag         Finnmark            Troms 
##               82               31               28               12

Looks like there are hotspots for Calluna vulgaris in Østfold and Møre og Romsdal - could this be sampling bias or effects of bioclimatic niche preferences and local environmental conditions? Questions like these you will be able to answer with additional analyses which are beyond the scope of this workshop.

Let’s visualise these abundances:

NO_municip$abundances <- abundances_municip
ggplot(data = NO_municip) +
  geom_sf(aes(fill = abundances)) +
  scale_fill_viridis_c() +
  labs(x = "Longitude", y = "Latitude", col = "Abundance")

Finally, let’s consider wanting to identify for each data record and attach to the data itself which state it falls into. We can do so as follows (not necessarily the most elegant way:

## create a dataframe of occurrence records by rownumber in original data (data_subset) and state-membership
cover_ls <- lapply(names(cover_sf), FUN = function(x) {
  data.frame(
    municip = x,
    points = cover_sf[[x]]
  )
})
cover_df <- do.call(rbind, cover_ls)
## attach state-membership to original data, NAs for points without state-membership
data_subset$municip <- NA
data_subset$municip[cover_df$points] <- cover_df$municip
## visualise the result
ggplot(data = NO_municip) +
  geom_sf(fill = "white") +
  geom_point(
    data = data_subset@data, size = 1,
    aes(x = decimalLongitude, y = decimalLatitude, col = municip)
  ) +
  scale_colour_viridis_d() +
  labs(x = "Longitude", y = "Latitude", col = "Municipality")

Let’s say we feed these data into an analysis which runs to completion and we want to report on our findings. What’s next? Citing the data we used.

Now that you can handle GBIF data locally, you are ready to pipe these data into your specific analysis tools. Lastly, you only need to cite the data you used.

Session Info

## R version 4.3.0 (2023-04-21)
## Platform: x86_64-apple-darwin20 (64-bit)
## Running under: macOS Ventura 13.3.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-x86_64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Oslo
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] rnaturalearth_0.3.2 raster_3.6-20       rgeos_0.6-2         maptools_1.1-6     
## [5] ggplot2_3.4.2       sf_1.0-12           sp_1.6-0            knitr_1.42         
## [9] rgbif_3.7.7        
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.3             xfun_0.39                bslib_0.4.2             
##  [4] lattice_0.21-8           vctrs_0.6.2              tools_4.3.0             
##  [7] generics_0.1.3           curl_5.0.0               tibble_3.2.1            
## [10] proxy_0.4-27             fansi_1.0.4              highr_0.10              
## [13] pkgconfig_2.0.3          R.oo_1.25.0              KernSmooth_2.23-20      
## [16] data.table_1.14.8        lifecycle_1.0.3          R.cache_0.16.0          
## [19] farver_2.1.1             compiler_4.3.0           stringr_1.5.0           
## [22] munsell_0.5.0            terra_1.7-29             codetools_0.2-19        
## [25] htmltools_0.5.5          class_7.3-21             sass_0.4.6              
## [28] yaml_2.3.7               lazyeval_0.2.2           pillar_1.9.0            
## [31] jquerylib_0.1.4          whisker_0.4.1            R.utils_2.12.2          
## [34] classInt_0.4-9           cachem_1.0.8             wk_0.7.2                
## [37] styler_1.9.1             tidyselect_1.2.0         digest_0.6.31           
## [40] stringi_1.7.12           dplyr_1.1.2              purrr_1.0.1             
## [43] bookdown_0.34            labeling_0.4.2           fastmap_1.1.1           
## [46] grid_4.3.0               colorspace_2.1-0         cli_3.6.1               
## [49] magrittr_2.0.3           triebeard_0.4.1          crul_1.4.0              
## [52] utf8_1.2.3               e1071_1.7-13             foreign_0.8-84          
## [55] withr_2.5.0              scales_1.2.1             bit64_4.0.5             
## [58] oai_0.4.0                rmarkdown_2.21           httr_1.4.5              
## [61] bit_4.0.5                blogdown_1.16            rnaturalearthhires_0.2.1
## [64] R.methodsS3_1.8.2        evaluate_0.20            rgdal_1.6-6             
## [67] viridisLite_0.4.2        s2_1.1.3                 urltools_1.7.3          
## [70] rlang_1.1.1              Rcpp_1.0.10              httpcode_0.3.0          
## [73] glue_1.6.2               DBI_1.1.3                xml2_1.3.4              
## [76] rstudioapi_0.14          jsonlite_1.8.4           R6_2.5.1                
## [79] plyr_1.8.8               units_0.8-2
Previous
Next