# Regressions

# Theory

These are exercises and solutions meant as a compendium to my talk on Model Selection and Model Building.

I have prepared some Lecture Slides for this session.

# Our Resarch Project

Today, we are looking at a big (and entirely fictional) data base of the common house sparrow (*Passer domesticus*). In particular, we are interested in the **Evolution of Passer domesticus in Response to Climate Change** which was previously explained here.

## The Data

I have created a large data set for this exercise which is available here and we previously cleaned up so that is now usable here.

## Reading the Data into `R`

Let’s start by reading the data into `R`

and taking an initial look at it:

```
Sparrows_df <- readRDS(file.path("Data", "SparrowDataClimate.rds"))
head(Sparrows_df)
```

```
## Index Latitude Longitude Climate Population.Status Weight Height Wing.Chord Colour Sex Nesting.Site Nesting.Height Number.of.Eggs Egg.Weight Flock Home.Range Predator.Presence Predator.Type
## 1 SI 60 100 Continental Native 34.05 12.87 6.67 Brown Male <NA> NA NA NA B Large Yes Avian
## 2 SI 60 100 Continental Native 34.86 13.68 6.79 Grey Male <NA> NA NA NA B Large Yes Avian
## 3 SI 60 100 Continental Native 32.34 12.66 6.64 Black Female Shrub 35.60 1 3.21 C Large Yes Avian
## 4 SI 60 100 Continental Native 34.78 15.09 7.00 Brown Female Shrub 47.75 0 NA E Large Yes Avian
## 5 SI 60 100 Continental Native 35.01 13.82 6.81 Grey Male <NA> NA NA NA B Large Yes Avian
## 6 SI 60 100 Continental Native 32.36 12.67 6.64 Brown Female Shrub 32.47 1 3.17 E Large Yes Avian
## TAvg TSD
## 1 269.9596 15.71819
## 2 269.9596 15.71819
## 3 269.9596 15.71819
## 4 269.9596 15.71819
## 5 269.9596 15.71819
## 6 269.9596 15.71819
```

## Hypotheses

Let’s remember our hypotheses:

**Sparrow Morphology**is determined by:

A.*Climate Conditions*with sparrows in stable, warm environments fairing better than those in colder, less stable ones.

B.*Competition*with sparrows in small flocks doing better than those in big flocks.

C.*Predation*with sparrows under pressure of predation doing worse than those without.

**Sites**accurately represent**sparrow morphology**. This may mean:

A.*Population status*as inferred through morphology.

B.*Site index*as inferred through morphology.

C.*Climate*as inferred through morphology.

Quite obviously, **hypothesis 1** is the only one lending itself well to regression exercises. Since we have three variables that describe sparrow morphology (`Weight`

, `Height`

, `Wing.Chord`

) and multi-response-variable models are definitely above the pay-grade of this material, we need to select one of our morphology variables as our response variable here. Remembering the Classification exercise, we recall that `Weight`

was the most informative morphological trait so far. Hence, we stick with this one for these exercises.

`R`

Environment

For this exercise, we will need the following packages:

```
install.load.package <- function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, repos = "http://cran.us.r-project.org")
}
require(x, character.only = TRUE)
}
package_vec <- c(
"ggplot2", # for visualisation
"nlme", # for mixed effect models
"HLMdiag" # for leverage of mixed effect models
)
sapply(package_vec, install.load.package)
```

```
## ggplot2 nlme HLMdiag
## TRUE TRUE TRUE
```

Using the above function is way more sophisticated than the usual `install.packages()`

& `library()`

approach since it automatically detects which packages require installing and only install these thus not overwriting already installed packages.

# Linear Regression

Remember the **Assumptions of Linear Regression**:

- Variable values follow homoscedasticity (equal variance across entire data range)

- Residuals follow normal distribution (normality)

- Absence of influential outliers

- Response and Predictor are related in a linear fashion

## Climate Conditions

`Weight`

as a result of average temperature (`TAvg`

)

Before we begin, let’s plot the data we want to model:

```
ggplot(data = Sparrows_df, aes(y = Weight, x = TAvg)) +
stat_smooth(method = "lm") +
geom_point() +
theme_bw()
```

I have an inkling that we might run into some issues here, but let’s continue for now:

Let’s build the actual model:

```
H1_ClimateTavg <- lm(Weight ~ TAvg, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_ClimateTavg)
```

While the meeting of most of the assumptions here might be debatable, we certainly cannot accept a linear model with residuals this non-normal distributed. Sow hat do we? We try to remove as many confounding effects as possible!

Remember the plot-locations, climates, and population statues in the data set (go back here in necessary). How about we look exclusively at stations in the Americas which are all housing introduced sparrow populations and almost exclusively lie in coastal habitats? I’ll remove all non-coastal climate sites and only consider the central and North America here. Let’s do that:

```
# everything west of -7° is on the Americas
# every that's coastal climate type is retained
# everything north of 11° is central and north America
CentralNorthAm_df <- Sparrows_df[Sparrows_df$Longitude < -7 & Sparrows_df$Climate == "Coastal" & Sparrows_df$Latitude > 11, ]
ggplot(data = CentralNorthAm_df, aes(y = Weight, x = TAvg)) +
stat_smooth(method = "lm") +
geom_point() +
theme_bw()
```

```
H1_ClimateTavg <- lm(Weight ~ TAvg, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_ClimateTavg)
```

This looks sensible to me! The scatterplot shows that sparrows are lighter in warmer areas which makes sense to me.

`summary(H1_ClimateTavg)`

```
##
## Call:
## lm(formula = Weight ~ TAvg, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1440 -1.0861 0.0219 0.9823 3.9743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.167990 1.618752 27.903 < 2e-16 ***
## TAvg -0.049501 0.005635 -8.785 2.63e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.404 on 248 degrees of freedom
## Multiple R-squared: 0.2373, Adjusted R-squared: 0.2343
## F-statistic: 77.18 on 1 and 248 DF, p-value: 2.633e-16
```

Our model estimates show the same pattern.

`Weight`

as a result of temperature variability (`TSD`

)

I’ll continue with my North American, coastal subset here:

```
# everything west of -7° is on the Americas
# every that's coastal climate type is retained
# everything north of 11° is central and north America
CentralNorthAm_df <- Sparrows_df[Sparrows_df$Longitude < -7 & Sparrows_df$Climate == "Coastal" & Sparrows_df$Latitude > 11, ]
ggplot(data = CentralNorthAm_df, aes(y = Weight, x = TSD)) +
stat_smooth(method = "lm") +
geom_point() +
theme_bw()
```

```
H1_ClimateTSD <- lm(Weight ~ TSD, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_ClimateTSD)
```

`summary(H1_ClimateTSD)`

```
##
## Call:
## lm(formula = Weight ~ TSD, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.7978 -1.0053 0.0002 1.0042 3.5648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.61266 0.14922 198.45 <2e-16 ***
## TSD 0.22680 0.02069 10.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.319 on 248 degrees of freedom
## Multiple R-squared: 0.3263, Adjusted R-squared: 0.3236
## F-statistic: 120.1 on 1 and 248 DF, p-value: < 2.2e-16
```

Again all assumptions are met and the model itself is very intuitive: The more variable the climate, the heavier the sparrows.

`Weight`

as a result of both temperature mean (`TAvg`

) and temperature variability (`TSD`

)

Naturally, we continue with the same subset of the data as before:

```
# everything west of -7° is on the Americas
# every that's coastal climate type is retained
# everything north of 11° is central and north America
CentralNorthAm_df <- Sparrows_df[Sparrows_df$Longitude < -7 & Sparrows_df$Climate == "Coastal" & Sparrows_df$Latitude > 11, ]
H1_ClimateCont <- lm(Weight ~ TAvg + TSD, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_ClimateCont)
```

`summary(H1_ClimateCont)`

```
##
## Call:
## lm(formula = Weight ~ TAvg + TSD, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6593 -1.0263 0.0272 0.9207 3.2359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.59866 4.68943 3.540 0.000479 ***
## TAvg 0.04215 0.01518 2.777 0.005915 **
## TSD 0.38142 0.05931 6.431 6.52e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 247 degrees of freedom
## Multiple R-squared: 0.3467, Adjusted R-squared: 0.3414
## F-statistic: 65.54 on 2 and 247 DF, p-value: < 2.2e-16
```

Interestingly, average temperature has a different (and weaker) effect now than when investigated in isolation. Variability of temperature has become even more important (i.e. stronger effect). So which model should we use? That’s exactly what we’ll investigate in our session on Model Selection and Validation!

## Competition

Next, we build models which aim to explain sparrow `Weight`

through variables pertaining to competition. To do so, I first calculate the size for each flock at each site and append these to each bird:

```
FlockSizes <- with(Sparrows_df, table(Flock, Index))
Sparrows_df$Flock.Size <- NA
for (Flock_Iter in rownames(FlockSizes)) { # loop over flocks
for (Site_Iter in colnames(FlockSizes)) { # loop over sites
Positions <- Sparrows_df$Index == Site_Iter & Sparrows_df$Flock == Flock_Iter
Sparrows_df$Flock.Size[Positions] <- FlockSizes[Flock_Iter, Site_Iter]
}
}
```

With that done, we can now build our models up again like we did with the climate data before.

`Weight`

as a result of Home Range size (`Home.Range`

)

```
ggplot(data = Sparrows_df, aes(x = Home.Range, y = Weight)) +
geom_boxplot() +
theme_bw()
```

```
H1_CompetitionHR <- lm(Weight ~ Home.Range, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_CompetitionHR)
```

Nope. Those residuals have me nope out. We won’t use this model.

`Weight`

as a result of Flock Size size (`Flock.Size`

)

```
ggplot(data = Sparrows_df, aes(x = Flock.Size, y = Weight)) +
stat_smooth(method = "lm") +
geom_point() +
theme_bw()
```

```
H1_CompetitionFS <- lm(Weight ~ Flock.Size, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_CompetitionFS)
```

`summary(H1_CompetitionFS)`

```
##
## Call:
## lm(formula = Weight ~ Flock.Size, data = Sparrows_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7421 -1.2163 0.0961 1.2823 4.7176
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 35.456378 0.114500 309.7 <2e-16 ***
## Flock.Size -0.237908 0.003831 -62.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.893 on 1064 degrees of freedom
## Multiple R-squared: 0.7838, Adjusted R-squared: 0.7836
## F-statistic: 3857 on 1 and 1064 DF, p-value: < 2.2e-16
```

Now that’s a neat model! Not only do the diagnostics plot look spot-on (not showing these here), the relationship is strong and clear to see - the bigger the flock, the lighter the sparrow.

`Weight`

as a result of Home Range size (`Home.Range`

) and Flock Size size (`Flock.Size`

)

```
ggplot(data = Sparrows_df, aes(x = Flock.Size, y = Weight, col = Home.Range)) +
geom_point() +
stat_smooth(method = "lm") +
theme_bw()
```

```
H1_CompetitionFULL <- lm(Weight ~ Home.Range * Flock.Size, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_CompetitionFULL)
```

`summary(H1_CompetitionFULL)`

```
##
## Call:
## lm(formula = Weight ~ Home.Range * Flock.Size, data = Sparrows_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.5599 -1.2261 0.0427 1.2806 4.5553
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.29279 0.31986 104.086 < 2e-16 ***
## Home.RangeMedium 0.29700 0.89288 0.333 0.739
## Home.RangeSmall 1.88176 0.35265 5.336 1.16e-07 ***
## Flock.Size -0.07337 0.01848 -3.970 7.66e-05 ***
## Home.RangeMedium:Flock.Size -0.03363 0.05787 -0.581 0.561
## Home.RangeSmall:Flock.Size -0.16211 0.01896 -8.548 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.8 on 1060 degrees of freedom
## Multiple R-squared: 0.8051, Adjusted R-squared: 0.8042
## F-statistic: 876 on 5 and 1060 DF, p-value: < 2.2e-16
```

We see that our model is unsure of what to do with the home ranges when it comes to medium-sized ranges. The scatterplot shows that is is probably due to us not having a lot of samples for medium-sized home-ranges. Aside from that, our model meets all assumptions and produces quite intuitive parameter estimates.

## Predation

Next, we look at sparrow `Weight`

through the lens of predation. To do so, we need to recode all `NA`

s in the `Predator.Type`

variable into something else for our models ro tun properly. I chose `"None"`

here to indicate that there is no predation-pressure at these sites:

```
levels(Sparrows_df$Predator.Type) <- c(levels(Sparrows_df$Predator.Type), "None")
Sparrows_df$Predator.Type[is.na(Sparrows_df$Predator.Type)] <- "None"
```

With that taken care of, we again build our models one-by-one.

Again, because of issues with normality of residuals, we default to our three sites across Central and North America, which are of coastal climate:

```
# everything west of -7° is on the Americas
# every that's coastal climate type is retained
# everything north of 11° is central and north America
CentralNorthAm_df <- Sparrows_df[Sparrows_df$Longitude < -7 & Sparrows_df$Climate == "Coastal" & Sparrows_df$Latitude > 11, ]
```

`Weight`

as a result of `Predator.Presence`

```
ggplot(data = CentralNorthAm_df, aes(x = Predator.Presence, y = Weight)) +
geom_boxplot() +
theme_bw()
```

```
H1_PredationPresence <- lm(Weight ~ Predator.Presence, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_PredationPresence)
```

`summary(H1_PredationPresence)`

```
##
## Call:
## lm(formula = Weight ~ Predator.Presence, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5051 -1.0701 -0.0396 1.0749 4.2549
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.4141 0.1752 179.261 < 2e-16 ***
## Predator.PresenceYes -0.6589 0.2131 -3.092 0.00222 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.577 on 248 degrees of freedom
## Multiple R-squared: 0.03711, Adjusted R-squared: 0.03323
## F-statistic: 9.557 on 1 and 248 DF, p-value: 0.002219
```

According to our model, sparrows under pressure of predation are lighter than those which aren’t. Does that make sense? Intuitively, yes, but I would argue that there are too many confounding variable here to be sure.

`Weight`

as a result of `Predator.Type`

```
ggplot(data = CentralNorthAm_df, aes(x = Predator.Type, y = Weight)) +
geom_boxplot() +
theme_bw()
```

```
H1_PredationType <- lm(Weight ~ Predator.Type, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_PredationType)
```

`summary(H1_PredationType)`

```
##
## Call:
## lm(formula = Weight ~ Predator.Type, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6593 -1.0263 0.0272 0.9207 3.2359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.9093 0.1270 235.439 < 2e-16 ***
## Predator.TypeNon-Avian 2.2335 0.2064 10.819 < 2e-16 ***
## Predator.TypeNone 1.5047 0.1925 7.817 1.56e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.302 on 247 degrees of freedom
## Multiple R-squared: 0.3467, Adjusted R-squared: 0.3414
## F-statistic: 65.54 on 2 and 247 DF, p-value: < 2.2e-16
```

OK. This clearly shows that there are either some confounds present or that our data shows something very counter-intuitive. Sparrows under nor predation are lighter than sparrows under non-avian predation? That makes no sense to me.

`Weight`

as a result of `Predator.Presence`

and `Predator.Type`

```
ggplot(data = CentralNorthAm_df, aes(x = Predator.Presence, y = Weight, fill = Predator.Type)) +
geom_boxplot() +
theme_bw()
```

```
H1_PredationFULL <- lm(Weight ~ Predator.Presence + Predator.Type, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_PredationFULL)
```

`summary(H1_PredationFULL)`

```
##
## Call:
## lm(formula = Weight ~ Predator.Presence + Predator.Type, data = Sparrows_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6635 -2.2987 -0.0844 1.9563 9.6165
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.6145 0.1816 168.60 <2e-16 ***
## Predator.PresenceYes -3.5709 0.2387 -14.96 <2e-16 ***
## Predator.TypeNon-Avian 5.2809 0.2789 18.94 <2e-16 ***
## Predator.TypeNone NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.431 on 1063 degrees of freedom
## Multiple R-squared: 0.2901, Adjusted R-squared: 0.2888
## F-statistic: 217.2 on 2 and 1063 DF, p-value: < 2.2e-16
```

Those residuals don’t look good, but that’s not what I am after with this model. Immediately, we should notice that our model returns `NA`

for the parameter estimate of `Predator.TypeNone`

. Why does that happen? Because `Predator.TypeNone`

coincides with `Predator.PresenceNo`

(the `Intercept`

) of this model and so does not provide any additional information. In fact, `Predator.Type`

offers all the information of `Predator.Presence`

and then some! Consequently, including both in a model does not make sense and we can immediately disqualify this model.

## Null & Full Model

For some comparison further down the line, we need a null model:

```
H1_Null_Sparrows <- lm(Weight ~ 1, data = Sparrows_df)
H1_Null_CNA <- lm(Weight ~ 1, data = CentralNorthAm_df)
```

Our full model contains all of our aforementioned variables/parameters to the best of our knowledge/intuition at this point. We will get to making this model better later. Don’t worry:

```
H1_FULL_Sparrows <- lm(Weight ~ Climate + TAvg + TSD + Home.Range * Flock.Size + Predator.Type, data = Sparrows_df)
par(mfrow = c(2, 2))
plot(H1_FULL_Sparrows)
```

`summary(H1_FULL_Sparrows)`

```
##
## Call:
## lm(formula = Weight ~ Climate + TAvg + TSD + Home.Range * Flock.Size +
## Predator.Type, data = Sparrows_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.2164 -1.0436 0.0939 1.0495 4.7403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.65271 4.09482 10.660 < 2e-16 ***
## ClimateContinental 2.28851 0.31524 7.259 7.53e-13 ***
## ClimateSemi-Coastal -0.79097 0.30663 -2.580 0.01003 *
## TAvg -0.04252 0.01402 -3.034 0.00247 **
## TSD 0.01431 0.04032 0.355 0.72272
## Home.RangeMedium 1.63110 0.85370 1.911 0.05632 .
## Home.RangeSmall 2.36146 0.41639 5.671 1.83e-08 ***
## Flock.Size -0.03445 0.01896 -1.817 0.06955 .
## Predator.TypeNon-Avian 0.24253 0.16673 1.455 0.14605
## Predator.TypeNone 0.75123 0.17268 4.350 1.49e-05 ***
## Home.RangeMedium:Flock.Size -0.10115 0.05614 -1.802 0.07186 .
## Home.RangeSmall:Flock.Size -0.16452 0.01990 -8.267 4.09e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.605 on 1054 degrees of freedom
## Multiple R-squared: 0.846, Adjusted R-squared: 0.8444
## F-statistic: 526.6 on 11 and 1054 DF, p-value: < 2.2e-16
```

Already at this point, it is interesting to point out how some previously non-significant effects drop out while other become significant due to the inclusion of all parameters in one model.

Now we also need a full model for our Central-/North-America data:

```
H1_FULL_CNA <- lm(Weight ~ TAvg + TSD + Home.Range * Flock.Size + Predator.Type, data = CentralNorthAm_df)
par(mfrow = c(2, 2))
plot(H1_FULL_CNA)
```

`summary(H1_FULL_CNA)`

```
##
## Call:
## lm(formula = Weight ~ TAvg + TSD + Home.Range * Flock.Size +
## Predator.Type, data = CentralNorthAm_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.6979 -0.9364 -0.0501 1.0483 3.3590
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.18473 6.20215 2.287 0.0231 *
## TAvg 0.05336 0.02038 2.618 0.0094 **
## TSD 0.39853 0.07842 5.082 7.48e-07 ***
## Home.RangeMedium -3.83588 1.99164 -1.926 0.0553 .
## Home.RangeSmall -1.50681 1.03368 -1.458 0.1462
## Flock.Size -0.07581 0.05850 -1.296 0.1963
## Predator.TypeNon-Avian NA NA NA NA
## Predator.TypeNone NA NA NA NA
## Home.RangeMedium:Flock.Size 0.29821 0.13273 2.247 0.0256 *
## Home.RangeSmall:Flock.Size 0.10097 0.06212 1.625 0.1054
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.285 on 242 degrees of freedom
## Multiple R-squared: 0.3765, Adjusted R-squared: 0.3585
## F-statistic: 20.88 on 7 and 242 DF, p-value: < 2.2e-16
```

## Summary of Linear Regression

To streamline the next few steps of our exercise on Model Selection and Validation, we combine all of our models into a `list`

object for now:

```
H1_ModelSparrows_ls <- list(
H1_Null_Sparrows,
H1_CompetitionFS,
H1_CompetitionFULL,
H1_FULL_Sparrows
)
names(H1_ModelSparrows_ls) <- c(
"Null",
"Comp_Flock.Size", "Comp_Full",
"Full"
)
H1_ModelCNA_ls <- list(
H1_Null_CNA,
H1_ClimateTavg,
H1_ClimateTSD,
H1_ClimateCont,
H1_PredationPresence,
H1_PredationType,
H1_FULL_CNA
)
names(H1_ModelCNA_ls) <- c(
"Null",
"Clim_TAvg", "Clim_TSD", "Clim_Full",
"Pred_Pres", "Pred_Type",
"Full"
)
```

# Mixed Effect Models

Remember the **Assumptions of Mixed Effect Models**:

- Variable values follow homoscedasticity (equal variance across entire data range)

- Residuals follow normal distribution (normality)

- Absence of influential outliers

- Response and Predictor are related in a linear fashion

## Climate Conditions

Here, I create a mixed effect model that is accounting predicting `Weight`

with a combination `TAvg`

and `TSD`

while accounting for random intercepts of `Index`

and `Population.Status`

. As we learned in our exercise on Classifications, these two are probably important in controlling for some morphology effects in `Weight`

.

I create such a model for:

- The
**entire sparrow data data set**

- Only the
**Central-/North-America sparrows**

to make models comparable to our previous basic, linear models.

### Global Model

```
H1_Climate_ME_Sparrows <- lme(Weight ~ TAvg + TSD,
random = list(
Population.Status = ~1,
Index = ~1
),
data = Sparrows_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Climate_ME_Sparrows), resid(H1_Climate_ME_Sparrows)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Climate_ME_Sparrows), Sparrows_df$Weight) # Linearity, there is a clear pattern here --> bad!
qqnorm(resid(H1_Climate_ME_Sparrows)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Climate_ME_Sparrows)[, 1]) # Leverage, not really any outliers --> good!
```

Quite evidently, the assumption of linearity is violated. Notice that we would not like to use this model for predictions, but will carry it forward for the purpose of subsequent Model Selection and Validation.

`summary(H1_Climate_ME_Sparrows)`

```
## Linear mixed-effects model fit by REML
## Data: Sparrows_df
## AIC BIC logLik
## 3557.109 3586.922 -1772.554
##
## Random effects:
## Formula: ~1 | Population.Status
## (Intercept)
## StdDev: 0.001507855
##
## Formula: ~1 | Index %in% Population.Status
## (Intercept) Residual
## StdDev: 2.640311 1.237155
##
## Fixed effects: Weight ~ TAvg + TSD
## Value Std.Error DF t-value p-value
## (Intercept) 37.83099 31.229059 1055 1.2114034 0.2260
## TAvg -0.03070 0.104602 7 -0.2934692 0.7777
## TSD 0.27536 0.265143 7 1.0385452 0.3336
## Correlation:
## (Intr) TAvg
## TAvg -0.999
## TSD -0.826 0.809
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.956660984 -0.746001839 0.004452834 0.722973350 2.617329146
##
## Number of Observations: 1066
## Number of Groups:
## Population.Status Index %in% Population.Status
## 2 11
```

### Central-/North-American Model

Accounting for population status in among the Central-/North-American sites makes no sense as all of them are introduced populations.

```
H1_Climate_ME_CNA <- lme(Weight ~ TAvg + TSD,
random = list(Index = ~1),
data = CentralNorthAm_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Climate_ME_CNA), resid(H1_Climate_ME_CNA)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Climate_ME_CNA), CentralNorthAm_df$Weight) # Linearity, there is no clear pattern here --> good!
qqnorm(resid(H1_Climate_ME_CNA)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Climate_ME_CNA)[, 1]) # Leverage, not really any outliers --> good!
```

With this model, all of our assumption are met nicely!

`summary(H1_Climate_ME_CNA)`

```
## Linear mixed-effects model fit by REML
## Data: CentralNorthAm_df
## AIC BIC logLik
## 863.9566 881.5035 -426.9783
##
## Random effects:
## Formula: ~1 | Index
## (Intercept) Residual
## StdDev: 0.3802608 1.301734
##
## Fixed effects: Weight ~ TAvg + TSD
## Value Std.Error DF t-value p-value
## (Intercept) 16.598661 13.291655 247 1.2488031 0.2129
## TAvg 0.042146 0.042863 0 0.9832757 NaN
## TSD 0.381424 0.174011 0 2.1919464 NaN
## Correlation:
## (Intr) TAvg
## TAvg -0.999
## TSD -0.953 0.945
##
## Standardized Within-Group Residuals:
## Min Q1 Med Q3 Max
## -2.8111222 -0.7883782 0.0208856 0.7072618 2.4858581
##
## Number of Observations: 250
## Number of Groups: 3
```

## Competition

With competition effects as my goal, I can now include climate classification as one of the random effects! Of course, this is with the exception of the Central-/North-American data as all of these sites are of the type “Coastal”.

Again, we do so for a global and a local model:

### Global Model

```
H1_Comp_ME_Sparrows <- lme(Weight ~ Home.Range * Flock.Size,
random = list(
Population.Status = ~1,
Climate = ~1
),
data = Sparrows_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Comp_ME_Sparrows), resid(H1_Comp_ME_Sparrows)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Comp_ME_Sparrows), Sparrows_df$Weight) # Linearity, there is a clear pattern here --> bad!
qqnorm(resid(H1_Comp_ME_Sparrows)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Comp_ME_Sparrows)[, 1]) # Leverage, not really any outliers --> good!
```

### Central-/North-American Model

```
H1_Comp_ME_CNA <- lme(Weight ~ Home.Range * Flock.Size,
random = list(Climate = ~1),
data = CentralNorthAm_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Comp_ME_CNA), resid(H1_Comp_ME_CNA)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Comp_ME_CNA), CentralNorthAm_df$Weight) # Linearity, there is no clear pattern here --> good!
qqnorm(resid(H1_Comp_ME_CNA)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Comp_ME_CNA)[, 1]) # Leverage, not really any outliers --> good!
```

## Predation

With predation effects, I do the same as with competition effects and set a random intercept for climate classification at each site for a global and a local model:

### Global Model

```
H1_Pred_ME_Sparrows <- lme(Weight ~ Predator.Type,
random = list(
Population.Status = ~1,
Climate = ~1
),
data = Sparrows_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Pred_ME_Sparrows), resid(H1_Pred_ME_Sparrows)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Pred_ME_Sparrows), Sparrows_df$Weight) # Linearity, there is a clear pattern here --> bad!
qqnorm(resid(H1_Pred_ME_Sparrows)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Pred_ME_Sparrows)[, 1]) # Leverage, not really any outliers --> good!
```

### Central-/North-American Model

```
H1_Pred_ME_CNA <- lme(Weight ~ Predator.Type,
random = list(Index = ~1),
data = CentralNorthAm_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Pred_ME_CNA), resid(H1_Pred_ME_CNA)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Pred_ME_CNA), CentralNorthAm_df$Weight) # Linearity, there is no clear pattern here --> good!
qqnorm(resid(H1_Pred_ME_CNA)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Pred_ME_CNA)[, 1]) # Leverage, not really any outliers --> good!
```

## Full Model

### Global Model

```
H1_Full_ME_Sparrows <- lme(Weight ~ Predator.Type + Flock.Size * Home.Range + TAvg + TSD,
random = list(Population.Status = ~1),
data = Sparrows_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Full_ME_Sparrows), resid(H1_Full_ME_Sparrows)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Full_ME_Sparrows), Sparrows_df$Weight) # Linearity, there is a clear pattern here --> bad!
qqnorm(resid(H1_Full_ME_Sparrows)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Full_ME_Sparrows)[, 1]) # Leverage, not really any outliers --> good!
```

### Central-/North-American Model

```
H1_Full_ME_CNA <- lme(Weight ~ Flock.Size * Home.Range + TAvg + TSD,
random = list(Index = ~1),
data = CentralNorthAm_df
)
par(mfrow = c(2, 2))
plot(fitted(H1_Full_ME_CNA), resid(H1_Full_ME_CNA)) # Homogeneity of Variances, values around 0, no pattern --> good!
plot(fitted(H1_Full_ME_CNA), CentralNorthAm_df$Weight) # Linearity, there is no clear pattern here --> good!
qqnorm(resid(H1_Full_ME_CNA)) # Normality, residuals are normal distributed -> good!
hist(leverage(H1_Full_ME_CNA)[, 1]) # Leverage, not really any outliers --> good!
```

## Summary of Mixed Effect Models

Mixed effect models are hard and I have certainly not given them full credit for what they are worth here. I highly suggest giving this a read if you see yourself using mixed effect models.

For now, I want to carry along only the full models for our session on Model Selection and Validation:

```
H1_ModelSparrows_ls$Mixed_Full <- H1_Full_ME_Sparrows
H1_ModelCNA_ls$Mixed_Full <- H1_Full_ME_CNA
```

# Generalised Linear Models

For a generalised linear model, we may want to run a logistic regression, which we already did.

For a different example, we now turn to *poisson models* by trying to understand how `Flock.Size`

comes about.

Firstly, we limit our data set to necessary variables and then cut all duplicate rows:

```
Flock_df <- Sparrows_df[, c("Flock.Size", "TAvg", "TSD", "Index", "Climate", "Predator.Type")]
Flock_df <- unique(Flock_df)
```

Now we are ready to build a basic linear model:

```
Flock_lm <- lm(Flock.Size ~ TAvg * TSD * Predator.Type, data = Flock_df)
par(mfrow = c(2, 2))
plot(Flock_lm)
```

Everything looks line but that Scale-Location plot. That one seems to indicate that variance in our `Flock.Size`

increases as the `Flock.Size`

itself increases - a very typical example of a poisson distributed variable.

GLMs to the rescue with the poisson GLM:

```
poisson.model <- glm(Flock.Size ~ TAvg * TSD * Predator.Type, data = Flock_df, family = poisson(link = "log"))
par(mfrow = c(2, 2))
plot(poisson.model)
```

We fixed it! Now just for the parameter estimates:

`summary(poisson.model)`

```
##
## Call:
## glm(formula = Flock.Size ~ TAvg * TSD * Predator.Type, family = poisson(link = "log"),
## data = Flock_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.02420 -0.88953 -0.09629 0.94122 2.12737
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.822674 2.540369 -5.441 5.29e-08 ***
## TAvg 0.061744 0.008564 7.209 5.62e-13 ***
## TSD 7.243621 1.056843 6.854 7.18e-12 ***
## Predator.TypeNon-Avian -48.160439 9.412905 -5.116 3.11e-07 ***
## Predator.TypeNone 10.338971 8.694652 1.189 0.234
## TAvg:TSD -0.026895 0.003936 -6.833 8.34e-12 ***
## TAvg:Predator.TypeNon-Avian 0.171883 0.033327 5.157 2.50e-07 ***
## TAvg:Predator.TypeNone -0.039510 0.029180 -1.354 0.176
## TSD:Predator.TypeNon-Avian 0.067442 0.041609 1.621 0.105
## TSD:Predator.TypeNone -6.742437 1.177538 -5.726 1.03e-08 ***
## TAvg:TSD:Predator.TypeNon-Avian NA NA NA NA
## TAvg:TSD:Predator.TypeNone 0.025061 0.004321 5.800 6.64e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 276.538 on 49 degrees of freedom
## Residual deviance: 53.417 on 39 degrees of freedom
## AIC: 310.71
##
## Number of Fisher Scoring iterations: 4
```

# Final Models

In our upcoming Model Selection and Validation Session, we will look into how to compare and validate models. We now need to select some models we have created here today and want to carry forward to said session.

We have already created `list`

objects for this purpose. Let’s save them alongside the data that was used to create them (in the case of the localised models, at least). Let’s save these as a separate object ready to be loaded into our `R`

environment in the coming session:

`save(H1_ModelSparrows_ls, H1_ModelCNA_ls, CentralNorthAm_df, Sparrows_df, file = file.path("Data", "H1_Models.RData"))`

# SessionInfo

`sessionInfo()`

```
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252 LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
## [5] LC_TIME=English_United Kingdom.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] HLMdiag_0.5.0 nlme_3.1-152 ggplot2_3.3.3
##
## loaded via a namespace (and not attached):
## [1] styler_1.4.1 tidyselect_1.1.0 xfun_0.22 bslib_0.2.4 janitor_2.1.0 rematch2_2.1.2 purrr_0.3.4 reshape2_1.4.4 splines_4.0.5 lattice_0.20-41
## [11] snakecase_0.11.0 colorspace_2.0-0 vctrs_0.3.7 generics_0.1.0 htmltools_0.5.1.1 yaml_2.2.1 mgcv_1.8-34 utf8_1.2.1 rlang_0.4.10 R.oo_1.24.0
## [21] jquerylib_0.1.4 pillar_1.6.0 glue_1.4.2 withr_2.4.2 R.utils_2.10.1 R.cache_0.14.0 lifecycle_1.0.0 plyr_1.8.6 stringr_1.4.0 munsell_0.5.0
## [31] blogdown_1.3 gtable_0.3.0 R.methodsS3_1.8.1 diagonals_6.0.0 evaluate_0.14 labeling_0.4.2 knitr_1.33 fansi_0.4.2 highr_0.9 Rcpp_1.0.6
## [41] scales_1.1.1 backports_1.2.1 jsonlite_1.7.2 farver_2.1.0 digest_0.6.27 stringi_1.5.3 bookdown_0.22 dplyr_1.0.5 grid_4.0.5 tools_4.0.5
## [51] magrittr_2.0.1 sass_0.3.1 tibble_3.1.1 crayon_1.4.1 pkgconfig_2.0.3 ellipsis_0.3.1 MASS_7.3-53.1 Matrix_1.3-2 lubridate_1.7.10 rmarkdown_2.7
## [61] R6_2.5.0 compiler_4.0.5
```