Correlation Tests

Theory

Welcome to our third practical experience in R. Throughout the following notes, I will introduce you to a couple statistical correlation approaches that might be useful to you and are, to varying degrees, often used in biology. To do so, I will enlist the sparrow data set we handled in our first exercise.

I have prepared some Lecture Slides for this session.

Data

Find the data for this exercise here.

Preparing Our Procedure

To ensure others can reproduce our analysis we run the following three lines of code at the beginning of our R coding file.

rm(list=ls()) # clearing environment
Dir.Base <- getwd() # soft-coding our working directory
Dir.Data <- paste(Dir.Base, "Data", sep="/") # soft-coding our data directory 

Packages

Using the following, user-defined function, we install/load all the necessary packages into our current R session.

# function to load packages and install them if they haven't been installed yet
install.load.package <- function(x) {
  if (!require(x, character.only = TRUE))
    install.packages(x)
  require(x, character.only = TRUE)
}
package_vec <- c("DescTools", # Needed for Contingency Coefficient
                 "ggplot2" # needed for data visualisation
                 )
sapply(package_vec, install.load.package)
## Loading required package: DescTools
## Loading required package: ggplot2
## DescTools   ggplot2 
##      TRUE      TRUE

Loading Data

During our first exercise (Data Mining and Data Handling - Fixing The Sparrow Data Set) we saved our clean data set as an RDS file. To load this, we use the readRDS() command that comes with base R.

Data_df_base <- readRDS(file = paste(Dir.Data, "/1 - Sparrow_Data_READY.rds", sep=""))
Data_df <- Data_df_base # duplicate and save initial data on a new object

Nominal Scale - Contingency Coefficient

We can analyse correlations/dependencies of variables of the categorical kind using the contingency coefficient by calling the ContCoef() function of base R.

Keep in mind that the contingency coefficient is not really a measure of correlation but merely of association of variables. A value of \(c \approx 0\) indicates independent variables.

Predation

Are colour morphs of Passer domesticus linked to predator presence and/or predator type?

This analysis builds on our findings within our previous exercise (Nominal Tests - Analysing The Sparrow Data Set). Remember that, using the two-sample situation Chi-Squared Test, we found no change in treatment effects (as far as colour polymorphism went) for predator type values but did so regarding the presence of predators. Let’s repeat this here:

table(Data_df$Colour, Data_df$Predator.Presence)
##        
##          No Yes
##   Black  64 292
##   Brown 211  87
##   Grey   82 331
ContCoef(table(Data_df$Colour, Data_df$Predator.Presence))
## [1] 0.4421914
table(Data_df$Colour, Data_df$Predator.Type)
##        
##         Avian Non-Avian
##   Black   197        95
##   Brown    60        27
##   Grey    233        98
ContCoef(table(Data_df$Colour, Data_df$Predator.Type))
## [1] 0.02957682

Here, we find the same results as when using the Chi-Squared statistic and conclude that colour morphs of the common house sparrow are likely to be driven by predator presence but not the type of predator that is present.

Are nesting sites of Passer domesticus linked to predator presence and/or predator type?

Again, following our two-sample situation Chi-Squared analysis from last exercise, we want to test whether nesting site and predator presence/predator type are linked. The Chi-Squared analyses identified a possible link of nesting site and predator type but nor predator presence.

table(Data_df$Nesting.Site, Data_df$Predator.Presence)
##        
##          No Yes
##   Shrub  87 205
##   Tree   94 137
ContCoef(table(Data_df$Nesting.Site, Data_df$Predator.Presence))
## [1] 0.1130328
table(Data_df$Nesting.Site, Data_df$Predator.Type)
##        
##         Avian Non-Avian
##   Shrub   182        23
##   Tree     49        88
ContCoef(table(Data_df$Nesting.Site, Data_df$Predator.Type))
## [1] 0.4851588

Whilst there doesn’t seem to be any strong evidence linking nesting site and predator presence, predator type seems to be linked to what kind of nesting site Passer domesticus prefers thus supporting our Chi-Squared results.

Sexual Dimorphism

Are sex ratios of Passer domesticus related to climate types?

Recall that, in our last exercise, we found no discrepancy of proportions of the sexes among the entire data set using a binomial test. What we didn’t check yet was whether the sexes are distributed across the sites somewhat homogeneously or whether the sex ratios might be skewed according to climate types. Let’s do this:

# prepare climate type testing data
Data_df <- Data_df_base
Index <- Data_df$Index
# select all data belonging to the stations at which all parameters except for climate type are held constant
Rows <- which(Index == "SI" | Index == "UK" | Index == "RE" | Index == "AU" )
Data_df <- Data_df[Rows,]
Data_df$Climate <- droplevels(factor(Data_df$Climate))

# analysis
table(Data_df$Sex, Data_df$Climate)
##         
##          Coastal Continental
##   Female      91          76
##   Male        72          78
ContCoef(table(Data_df$Sex, Data_df$Climate))
## [1] 0.06470702

Quite obviously, they aren’t and, if there are any patterns in sex ratios to emerge, these are not likely to stem from climate types. Also keep in mind that we have a plethora of other variables at play whilst the information contained within the climate type variable is somewhat constrained and, in this case, bordering on uninformative (i.e. a coastal site in the Arctic might be more closely resembled by a continental mid-latitude site than by a tropical coastal site).

Ordinal Scale - Kendall’s Tau

Climate Warming/Extremes

Do heavier sparrows have heavier/less eggs?

A heavier weight of individual females alludes to a higher pool of resources being allocated by said individuals. There are multiple ways they might make use of it, one of them being investment in reproduction. To test how heavier females of Passer domesticus utilise their resources in reproduction, we use a Kendall’s Tau approach to finding links between female weight and average egg weight per nest/number of eggs per nest.
Obviously, both weight variables are metric in nature and so we could use other methods as well. On top of that, we first need to convert these into ranks before being able to run a Kendall’s Tau analysis as follows:

# overwriting changes in Data_df with base data
Data_df <- Data_df_base
# Establishing Ranks of Egg Weight
RankedEggWeight <- rank(Data_df$Egg.Weight[which(Data_df$Sex == "Female")],
                        ties.method = "average")
# Establishing Ranks of Female Weight
RankedWeight_Female <- rank(Data_df$Weight[which(Data_df$Sex == "Female")],
                            ties.method = "average")
# Extracting Numbers of Eggs
RankedEggs <- Data_df$Number.of.Eggs[which(Data_df$Sex == "Female")]

Luckily enough, the number of eggs per nest already represent a ranked (ordinal) variable and so we can move straight on to running our analyses:

# Test ranked female weight vs. ranked egg weight
cor.test(x = RankedWeight_Female, y = RankedEggWeight, 
         use = "pairwise.complete.obs", method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  RankedWeight_Female and RankedEggWeight
## z = 19.771, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##       tau 
## 0.5804546

There is strong evidence to suggest that heavier females tend to lay heavier eggs (tau = 0.5804546 at p \(\approx\) 0).

# Test ranked female weight vs. number of eggs
cor.test(x = RankedWeight_Female, y = RankedEggs, 
         use = "pairwise.complete.obs", method = "kendall")
## 
##  Kendall's rank correlation tau
## 
## data:  RankedWeight_Female and RankedEggs
## z = -21.787, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
##        tau 
## -0.6880932

Additionally, the heavier a female of Passer domesticus, the less eggs she produces (tau = -0.6880932 at p \(\approx\) 0).

Now we can visualise the underlying patterns:

plot_df <- data.frame(RankedWeight = RankedWeight_Female,
                      RankedEggWeight = RankedEggWeight,
                      RankedEggNumber = RankedEggs)
# plot ranked female weight vs. ranked egg weight
ggplot(data = plot_df, aes(x = RankedWeight, y = RankedEggWeight)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Ranked female weight of Passer domesticus vs. Ranked weigt of eggs")
## `geom_smooth()` using formula 'y ~ x'

# plot ranked female weight vs. number of eggs
ggplot(data = plot_df, aes(x = RankedWeight, y = RankedEggNumber)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Ranked female weight of Passer domesticus vs. Ranked number of eggs")
## `geom_smooth()` using formula 'y ~ x'

This highlights an obvious and intuitive trade-off in nature and has us reject the null hypotheses.

Metric and Ordinal Scales

Metric scale correlation analyses call for:

  • Spearman correlation test (non-parametric)
  • Pearson correlation test (parametric, requires data to be normal distributed)

Testing for Normality

DISCLAIMER: I do not expect you to do it this way as of right now but wanted to give you reference material of how to automate this testing step.

Since most of our following analyses are focussing on latitude effects (i.e. “climate warming”), we need to alter our base data. Before we can run the analyses, we need to eliminate the sites that we have set aside for testing climate extreme effects on (Siberia, United Kingdom, Reunion and Australia) from the data set. We are also not interested in all variables and so reduce the data set further.

Data_df <- Data_df_base
Index <- Data_df$Index
Rows <- which(Index != "SI" & Index != "UK" & Index != "RE" & Index != "AU" )
Data_df <- Data_df[Rows,c("Index", "Latitude", "Weight", "Height", "Wing.Chord", "Nesting.Height", 
                             "Number.of.Eggs", "Egg.Weight")]

In order to know which test we can use with which variable, we need to first identify whether our data is normal distributed using the Shapiro-Test (Seminar 3) as follows:

Normal_df <- data.frame(Normality = as.character(), stringsAsFactors = FALSE)
for(i in 2:length(colnames(Data_df))){
  Normal_df[1,i] <- colnames(Data_df)[i]
  Normal_df[2,i] <- round(shapiro.test(as.numeric(Data_df[,i]))$p.value, 2)
}
colnames(Normal_df) <- c()
rownames(Normal_df) <- c("Variable", "p")
Normal_df <- Normal_df[,-1] # removing superfluous Index column
Normal_df
##                                                                         
## Variable Latitude Weight Height Wing.Chord Nesting.Height Number.of.Eggs
## p               0      0      0          0              0              0
##                    
## Variable Egg.Weight
## p                 0

Unfortunately, none of these variables seem to be normal distributed (this was to be expected for some, to be fair) thus barring us from using Pearson correlation on the entire data set leaving us with the Spearman correlation method.

Spearman

Climate Warming/Extremes

Height/Length

Do height/length records of Passer domesticus and latitude correlate?

Following Bergmann’s rule, we expect a positive correlation between sparrow height/length and absolute values of latitude:

cor.test(x = abs(Data_df$Latitude), y = Data_df$Height, 
         use = "pairwise.complete.obs", method = "spearman")
## Warning in cor.test.default(x = abs(Data_df$Latitude), y = Data_df$Height, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  abs(Data_df$Latitude) and Data_df$Height
## S = 128104735, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.8219373
ggplot(data = Data_df, aes(x = abs(Latitude), y = Height)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Height of Passer domesticus vs. Latitude")
## `geom_smooth()` using formula 'y ~ x'

Interestingly enough, our analysis yields a negative correlation which would disproof Bermann’s rule. This is a good example to show how important biological background knowledge is when doing biostatistics. Whilst a pure statistician might now believe to have just dis-proven a big rule of biology, it should be apparent to any biologist that Bergmann spoke of “bigger” organisms in colder climates (higher latitudes) and not of “taller” individuals. What our sparrows lack in height, they might make up for in circumference. This is an example where we would reject the null hypothesis but shouldn’t accept the alternative hypothesis based on biological understanding.

Weight

Do weight records of Passer domesticus and latitude correlate?

Again, following Bergmann’s rule, we expect a positive correlation between sparrow weight and absolute values of latitude.

cor.test(x = abs(Data_df$Latitude), y = Data_df$Weight, 
         use = "pairwise.complete.obs", method = "spearman")
## Warning in cor.test.default(x = abs(Data_df$Latitude), y = Data_df$Weight, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  abs(Data_df$Latitude) and Data_df$Weight
## S = 9349037, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.8670357
ggplot(data = Data_df, aes(x = abs(Latitude), y = Weight)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Weight of Passer domesticus vs. Latitude")
## `geom_smooth()` using formula 'y ~ x'

Bergmann was obviously right and we reject the null hypothesis.

Wing Chord

Do wing chord/wing span records of Passer domesticus and latitude correlate?

We would expect sparrows in higher latitudes (e.g. colder climates) to have smaller wings as to radiate less body heat.

cor.test(x = abs(Data_df$Latitude), y = Data_df$Wing.Chord, 
         use = "pairwise.complete.obs", method = "spearman")
## Warning in cor.test.default(x = abs(Data_df$Latitude), y = Data_df$Wing.Chord, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  abs(Data_df$Latitude) and Data_df$Wing.Chord
## S = 134573929, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##        rho 
## -0.9139437
ggplot(data = Data_df, aes(x = abs(Latitude), y = Wing.Chord)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Wing chord of Passer domesticus vs. Latitude")
## `geom_smooth()` using formula 'y ~ x'

And we were right! Sparrows have shorter wingspans in higher latitudes and we reject the null hypothesis.

Number of Eggs

Do numbers of eggs per nest of Passer domesticus and latitude correlate?

Due to resource constraints in colder climates, we expect female Passer domesticus individuals to invest in quality over quantity by prioritising caring your fledglings by educing the amount of eggs they produce.

cor.test(x = abs(Data_df$Latitude), y = Data_df$Number.of.Eggs, 
         use = "pairwise.complete.obs", method = "spearman")
## Warning in cor.test.default(x = abs(Data_df$Latitude), y =
## Data_df$Number.of.Eggs, : Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  abs(Data_df$Latitude) and Data_df$Number.of.Eggs
## S = 14501794, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##      rho 
## -0.92853
ggplot(data = Data_df, aes(x = abs(Latitude), y = Number.of.Eggs)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Number of eggs of Passer domesticus nests vs. Latitude")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 394 rows containing non-finite values (stat_smooth).
## Warning: Removed 394 rows containing missing values (geom_point).

We were right. Female house sparrows produce less eggs per capita in higher latitudes and we reject the null hypothesis.

Egg Weight

Does average weight of eggs per nest of Passer domesticus and latitude correlate?

Due to the reduced investment in egg numbers that we have just proven, we expect females of Passer domesticus to allocate some of their saved resources into heavier eggs which may nurture unhatched offspring for longer and more effectively.

cor.test(x = abs(Data_df$Latitude), y = Data_df$Egg.Weight, 
         use = "pairwise.complete.obs", method = "spearman")
## Warning in cor.test.default(x = abs(Data_df$Latitude), y = Data_df$Egg.Weight, :
## Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  abs(Data_df$Latitude) and Data_df$Egg.Weight
## S = 1318221, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##     rho 
## 0.82321
ggplot(data = Data_df, aes(x = abs(Latitude), y = Egg.Weight)) +
  geom_point() + theme_bw() + stat_smooth(method = "lm") + 
  labs(title = "Weight of Passer domesticus eggs vs. Latitude")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 395 rows containing non-finite values (stat_smooth).
## Warning: Removed 395 rows containing missing values (geom_point).

Indeed, the higher the latitude, the heavier the average egg per nest of Passer domesticus and we reject the null hypothesis.

Pearson

We already know that we can’t analyse the entire data set at once for these two variables since their data values are not normal distirbuted. How about site-wise variable value distributions, though?

# Further reducing the data set
Data_df <- Data_df[,c("Index", "Weight", "Height")]
# establishing an empty data frame and an index vector that doesn't repeat
Normal_df <- data.frame(Normality = as.character(), stringsAsFactors = FALSE)
Indices <- as.character(unique(Data_df_base$Index[Rows]))
# site-wise shapiro test
for(i in 2:length(colnames(Data_df))){ # variables
  for(k in 1:length(Indices)){ # sites
    Normal_df[i,k] <- round(shapiro.test(as.numeric(
          Data_df[,i][which(Data_df_base$Index[Rows] == Indices[k])])
        )$p.value, 2)}} # site loop, variable loop
rownames(Normal_df) <- colnames(Data_df)
colnames(Normal_df) <- Indices
Normal_df[-1,] # remove superfluous index row
##          NU   MA   LO   BE   FG   SA   FI
## Weight 0.57 0.12 0.50 0.38 0.18 0.76 0.43
## Height 0.23 0.03 0.19 0.59 0.88 0.27 0.86

According to these results intra-site correlations of sparrow weight and height can be carried out! So, in order to show Pearson correlation, we run a simple, site-wise correlation analysis.

Do weight and height records of Passer domesticus correlate within each site?

To shed some light on our previous findings, we might want to see whether weight and height of sparrows correlate. Without running the analysis, we can conclude that they do because both correlate with latitude and are thus what we call collinear. However, now we are running the analysis on a site level - does the correlation still exist?

Take note that we are now using our entire data set again.

# overwriting altered Data_df
Data_df <- Data_df_base
# establishing an empty data frame and an index vector that doesn't repeat
Pearson_df <- data.frame(Pearson = as.character(), stringsAsFactors = FALSE)
Indices <- as.character(unique(Data_df$Index))
# site-internal correlation tests, weight and height
for(i in 1:length(unique(Data_df$Index))){
  Weights <- Data_df$Weight[which(Data_df$Index == Indices[i])]
  Heights <- Data_df$Height[which(Data_df$Index == Indices[i])]
  Pearson_df[1,i] <- round(cor.test(x = Weights, y = Heights, 
                                    use = "pairwise.complete.obs")[["estimate"]][["cor"]], 2)
  Pearson_df[2,i] <- round(cor.test(x = Weights, y = Heights, 
                                    use = "pairwise.complete.obs")$p.value, 2)}
colnames(Pearson_df) <- Indices
rownames(Pearson_df) <- c("r", "p")
Pearson_df
##     SI   UK   AU   RE   NU   MA   LO   BE   FG   SA   FI
## r 0.76 0.83 0.77 0.75 0.84 0.84 0.79 0.82 0.79 0.76 0.81
## p    0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

Apparently, it does. Heavier birds are taller!

Previous
Next