Research Project

Our Resarch Project

Here (and over the next few exercises in this “course”), we are looking at a big (and entirely fictional) data base of the common house sparrow (Passer domesticus). In particular, we are interested in the Evolution of Passer domesticus in Response to Climate Change.

The Data

I have created a large data set for this exercise which is available in a cleaned and properly handled version here.

Reading the Data into R

Let’s start by reading the data into R and taking an initial look at it:

Sparrows_df <- readRDS(file.path("Data", "SparrowData.rds"))
Sparrows_df <- Sparrows_df[!$Weight), ]
##   Index Latitude Longitude     Climate Population.Status Weight Height Wing.Chord Colour    Sex Nesting.Site Nesting.Height Number.of.Eggs Egg.Weight Flock Home.Range Predator.Presence Predator.Type
## 1    SI       60       100 Continental            Native  34.05  12.87       6.67  Brown   Male         <NA>             NA             NA         NA     B      Large               Yes         Avian
## 2    SI       60       100 Continental            Native  34.86  13.68       6.79   Grey   Male         <NA>             NA             NA         NA     B      Large               Yes         Avian
## 3    SI       60       100 Continental            Native  32.34  12.66       6.64  Black Female        Shrub          35.60              1       3.21     C      Large               Yes         Avian
## 4    SI       60       100 Continental            Native  34.78  15.09       7.00  Brown Female        Shrub          47.75              0         NA     E      Large               Yes         Avian
## 5    SI       60       100 Continental            Native  35.01  13.82       6.81   Grey   Male         <NA>             NA             NA         NA     B      Large               Yes         Avian
## 6    SI       60       100 Continental            Native  32.36  12.67       6.64  Brown Female        Shrub          32.47              1       3.17     E      Large               Yes         Avian


When building models or trying to explain anything about our data set, we need to consider all the different variables and the information contained therein. In this data set, we have access to:

  1. Index [Factor] - an abbreviation of Site records
  2. Latitude [Numeric] - an identifier of where specific sparrow measurements where taken
  3. Longitude [Numeric] - an identifier of where specific sparrow measurements where taken
  4. Climate [Factor] - local climate types that sparrows are subjected to (e.g. coastal, continental, and semi-coastal)
  5. Population.Status [Factor] - population status (e.g. native or introduced)
  6. Weight [Numeric] - sparrow weight [g]; Range: 13-40g
  7. Height [Numeric] - sparrow height/length [cm]; Range: 10-22cm
  8. Wing.Chord [Numeric] - wing length [cm]; Range: 6-10cm
  9. Colour [Factor] - main plumage colour (e.g. brown, grey, and black)
  10. Sex [Factor] - sparrow sex
  11. Nesting.Site [Factor] - nesting conditions, only recorded for females (e.g. tree or shrub)
  12. Nesting.Height [Numeric] - nest elevation above ground level, only recorded for females
  13. Number.of.Eggs [Numeric] - number of eggs per nest, only recorded for females
  14. Egg.Weight [Numeric] - mean weight of eggs per nest, only recorded for females
  15. Flock [Factor] - which flock at each location each sparrow belongs to
  16. Home.Range [Factor] - size of home range of each flock (e.g. Small, Medium, and Large)
  17. Predator.Presence [Factor] - if a predator is present at a station (e.g. No or Yes)
  18. Predator.Type [Factor] - what kind of predator is present (e.g. Avian, Non-Avian, or None)

Note that the variables Longitude and Latitude may be used to retrieve climate data variables from a host of data sources.


Looking at our data, we notice that it comes at distinct stations. Let’s visualise where they are:

Plot_df <- Sparrows_df[, c("Longitude", "Latitude", "Index", "Climate", "Population.Status")]
Plot_df <- unique(Plot_df)
m <- leaflet()
m <- addTiles(m)
m <- addMarkers(m,
  lng = Plot_df$Longitude,
  lat = Plot_df$Latitude,
  label = Plot_df$Index,
  popup = paste(Plot_df$Population.Status, Plot_df$Climate, sep = ";")

Note that you can zoom and drag the above map as well as click the station markers for some additional information.

Adding Information

How do we get the data for this? Well, I wrote an R-Package that does exactly that.

First, said package needs to be installed from my GitHub repository for it. Subsequently, we need to set API Key and User number obtained at the Climate Data Store. I have already baked these into my material, so I don’t set them here, but include lines of code that ask you for your credentials when copy & pasted over:

if ("KrigR" %in% rownames(installed.packages()) == FALSE) { # KrigR check
#### CDS API (needed for ERA5-Land downloads)
if (!exists("API_Key") | !exists("API_User")) { # CS API check: if CDS API credentials have not been specified elsewhere
  API_User <- readline(prompt = "Please enter your Climate Data Store API user number and hit ENTER.")
  API_Key <- readline(prompt = "Please enter your Climate Data Store API key number and hit ENTER.")
} # end of CDS API check

if (!exists("numberOfCores")) { # Core check: if number of cores for parallel processing has not been set yet
  numberOfCores <- readline(prompt = paste("How many cores do you want to allocate to these processes? Your machine has", parallel::detectCores()))
} # end of Core check

Now that we have the package, we can download some state-of-the-art climate data. I have already prepared all of this in the data directory you downloaded earlier so this step will automatically be skipped:

if (!file.exists(file.path("Data", "SparrowDataClimate.rds"))) {
  colnames(Plot_df)[1:3] <- c("Lon", "Lat", "ID") # set column names to be in line with what KrigR wants
  Points_Raw <- download_ERA(
    Variable = "2m_temperature",
    DataSet = "era5",
    DateStart = "1982-01-01",
    DateStop = "2012-12-31",
    TResolution = "month",
    TStep = 1,
    Extent = Plot_df, # the point data with Lon and Lat columns
    Buffer = 0.5, # a 0.5 degree buffer should be drawn around each point
    ID = "ID", # this is the column which holds point IDs
    API_User = API_User,
    API_Key = API_Key,
    Dir = file.path(getwd(), "Data"),
    FileName = ""
  Points_mean <- calc(Points_Raw, fun = mean)
  Points_sd <- calc(Points_Raw, fun = sd)
  Sparrows_df$TAvg <- as.numeric(extract(x = Points_mean, y = Sparrows_df[, c("Longitude", "Latitude")], buffer = 0.3))
  Sparrows_df$TSD <- as.numeric(extract(x = Points_sd, y = Sparrows_df[, c("Longitude", "Latitude")], buffer = 0.3))
  saveRDS(Sparrows_df, file.path("Data", "SparrowDataClimate.rds"))
} else {
  Sparrows_df <- readRDS(file.path("Data", "SparrowDataClimate.rds"))

We have now effectively added two more variables to the data set:

  1. TAvg [Numeric] - Average air temperature for a 30-year time-period
  2. TSD [Numeric] - Standard deviation of mean monthly air temperature for a 30-year time-period

Now we have the data set we will look at for the rest of the exercises in this seminar series. But how did we get here? Find the answer here.


Let’s consider the following two hypotheses for our exercises for this simulated research project:

  1. Sparrow Morphology is determined by:
    A. Climate Conditions with sparrows in stable, warm environments fairing better than those in colder, less stable ones.
    B. Competition with sparrows in small flocks doing better than those in big flocks.
    C. Predation with sparrows under pressure of predation doing worse than those without.
  2. Sites accurately represent sparrow morphology. This may mean:
    A. Population status as inferred through morphology.
    B. Site index as inferred through morphology.
    C. Climate as inferred through morphology.

We try to answer these over the next few sessions.


## R version 4.0.2 (2020-06-22)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## Matrix products: default
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252    LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
## [5] LC_TIME=English_United Kingdom.1252    
## attached base packages:
## [1] parallel  stats     graphics  grDevices utils     datasets  methods   base     
## other attached packages:
##  [1] KrigR_0.1.0       httr_1.4.2        stars_0.4-3       abind_1.4-5       fasterize_1.0.3   sf_0.9-6          lubridate_1.7.9   automap_1.0-14    doParallel_1.0.15 iterators_1.0.12 
## [11] foreach_1.5.0     rgdal_1.5-18      raster_3.3-13     sp_1.4-4          stringr_1.4.0     keyring_1.1.0     ecmwfr_1.3.0      ncdf4_1.17        leaflet_2.0.3    
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5         lattice_0.20-41    FNN_1.1.3          class_7.3-17       zoo_1.8-8          assertthat_0.2.1   gstat_2.0-6        digest_0.6.27      R6_2.5.0           plyr_1.8.6        
## [11] backports_1.1.10   evaluate_0.14      e1071_1.7-3        blogdown_1.0.2     pillar_1.4.6       rlang_0.4.10       rstudioapi_0.11    R.utils_2.10.1     R.oo_1.24.0        rmarkdown_2.6     
## [21] styler_1.3.2       htmlwidgets_1.5.1  compiler_4.0.2     xfun_0.20          pkgconfig_2.0.3    htmltools_0.5.0    tidyselect_1.1.0   tibble_3.0.3       bookdown_0.21      intervals_0.15.2  
## [31] codetools_0.2-16   reshape_0.8.8      spacetime_1.2-3    crayon_1.3.4       dplyr_1.0.2        R.methodsS3_1.8.1  grid_4.0.2         lwgeom_0.2-5       jsonlite_1.7.2     lifecycle_0.2.0   
## [41] DBI_1.1.0          magrittr_2.0.1     units_0.6-7        KernSmooth_2.23-17 stringi_1.5.3      ellipsis_0.3.1     xts_0.12.1         vctrs_0.3.4        generics_0.0.2     rematch2_2.1.2    
## [51] tools_4.0.2        glue_1.4.2         R.cache_0.14.0     purrr_0.3.4        crosstalk_1.1.0.1  yaml_2.2.1         classInt_0.4-3     memoise_1.1.0      knitr_1.30