MODEL SELECTION AND STATISTICAL
SIGNIFICANCE
Reporting the Best Science
Erik Kusch
erik.kusch@au.dk
Section for Ecoinformatics & Biodiversity
Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)
Aarhus University
24/02/2021
Aarhus University Biostatistics - Why? What? How? 1 / 33
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 2 / 33
Model Selection
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 3 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection
What? Why? How?
What - Bias-Variance Trade-Off :
Trade-off between smooth and
flexible models:
Bias: error that is introduced by
modelling a data/real life problem
by a much simpler model
Variance: how much
ˆ
f
(estimated mapping function of
predictors and responses) would
change (vary) if the training data
set were to be changed
Simple models: High bias, low
variance under-fitting
Complex models: Low bias, high
variance over-fitting
Why - to identify Best Model
Finding the optimal trade-off
between bias and variance allows
for most reliable analyses
How - Model Selection Criteria:
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion
(BIC)
Receiver-Operator Characteristics
(ROCs)
...
Aarhus University Biostatistics - Why? What? How? 4 / 33
Model Selection (adjusted) R
2
R
2
In R: summary(...)$r.squared with ... being a regression object
Definition:
Proportion of variation in
Y
that can be explained by regression using
predictor(s) X. Values bound between 0 and 1.
Does not penalize complex models! Large R
2
values do not
necessarily imply a good model.
Calculation:
R
2
=
T SS RSS
T SS
= 1
RSS
T SS
=
1
n
(y
i
ˆy
i
)
2
1
n
(y
i
y
i
)
2
(1)
T SS
P
(y
i
y)
2
RSS
n
P
i=1
(y
i
ˆy
i
)
2
n Number of samples
Also called Coefficient of Determination.
Aarhus University Biostatistics - Why? What? How? 5 / 33
Model Selection (adjusted) R
2
R
2
In R: summary(...)$r.squared with ... being a regression object
Definition:
Proportion of variation in
Y
that can be explained by regression using
predictor(s) X. Values bound between 0 and 1.
Does not penalize complex models! Large R
2
values do not
necessarily imply a good model.
Calculation:
R
2
=
T SS RSS
T SS
= 1
RSS
T SS
=
1
n
(y
i
ˆy
i
)
2
1
n
(y
i
y
i
)
2
(1)
T SS
P
(y
i
y)
2
RSS
n
P
i=1
(y
i
ˆy
i
)
2
n Number of samples
Also called Coefficient of Determination.
Aarhus University Biostatistics - Why? What? How? 5 / 33
Model Selection (adjusted) R
2
Adjusted R
2
In R: summary(...)$adj.r.squared with ... being a regression object
Definition:
Proportion of variation in
Y
that can be explained by regression using
predictor(s) X. Values bound between 0 and 1.
Does penalize complex models! Increases only if a predictor is
significant and can improve the model fit.
Calculation:
R
2
adj
= 1
1
np1
(y
i
ˆy
i
)
2
1
n
(y
i
y
i
)
2
= R
2
(1 R
2
)
p
n p 1
(2)
T SS
P
(y
i
y)
2
RSS
n
P
i=1
(y
i
ˆy
i
)
2
n Number of samples
p Number of parameters
The larger p is relative to n, the larger the adjustment will be.
Aarhus University Biostatistics - Why? What? How? 6 / 33
Model Selection (adjusted) R
2
Adjusted R
2
In R: summary(...)$adj.r.squared with ... being a regression object
Definition:
Proportion of variation in
Y
that can be explained by regression using
predictor(s) X. Values bound between 0 and 1.
Does penalize complex models! Increases only if a predictor is
significant and can improve the model fit.
Calculation:
R
2
adj
= 1
1
np1
(y
i
ˆy
i
)
2
1
n
(y
i
y
i
)
2
= R
2
(1 R
2
)
p
n p 1
(2)
T SS
P
(y
i
y)
2
RSS
n
P
i=1
(y
i
ˆy
i
)
2
n Number of samples
p Number of parameters
The larger p is relative to n, the larger the adjustment will be.
Aarhus University Biostatistics - Why? What? How? 6 / 33
Model Selection Mallow’s C
p
Mallow’s C
p
In R: Cp() in CombMSC package
Definition:
Estimate of test mean squared error of regression model fit using
ordinary least squares.
Does penalize complex models!
Calculation:
C
p
=
1
n
RSS + 2pˆσ
2
(3)
RSS
n
P
i=1
(y
i
ˆy
i
)
2
n Number of samples
p Number of parameters
σ
2
Estimate of the variance of the error ε
Aarhus University Biostatistics - Why? What? How? 7 / 33
Model Selection Akaike Information Criterion (AIC)
Akaike Information Criterion (AIC)
In R: AIC() in base R
Definition:
Estimate of test mean squared error of regression model fit using
maximum likelihood estimation.
Does penalize complex models!
Calculation:
AIC = 2p + 2ln(L(
ˆ
θ)) (4)
p Number of parameters
L(
ˆ
θ) Maximum value of model likelihood function
For the standard linear model (Y = β
0
+
p
P
j=1
(β
j
X
j
) + ε) with Gaussian errors,
maximum likelihood and least squares are the same thing leading to
AIC =
1
nˆσ
2
RSS + 2pˆσ
2
(5)
Aarhus University Biostatistics - Why? What? How? 8 / 33
Model Selection Akaike Information Criterion (AIC)
Akaike Information Criterion (AIC)
In R: AIC() in base R
Definition:
Estimate of test mean squared error of regression model fit using
maximum likelihood estimation.
Does penalize complex models!
Calculation:
AIC = 2p + 2ln(L(
ˆ
θ)) (4)
p Number of parameters
L(
ˆ
θ) Maximum value of model likelihood function
For the standard linear model (Y = β
0
+
p
P
j=1
(β
j
X
j
) + ε) with Gaussian errors,
maximum likelihood and least squares are the same thing leading to
AIC =
1
nˆσ
2
RSS + 2pˆσ
2
(5)
Aarhus University Biostatistics - Why? What? How? 8 / 33
Model Selection Bayesian Information Criterion (BIC)
Bayesian Information Criterion (BIC)
In R: BIC() in base R
Definition:
Estimate of test mean squared error of regression model fit using
maximum likelihood estimation.
Generally penalizes complex models more than other metrics!
Calculation:
BIC = ln(n)p + 2ln(L(
ˆ
θ)) (6)
n Number of samples
p Number of parameters
L(
ˆ
θ) Maximum value of model likelihood function
For the standard linear model (Y = β
0
+
p
P
j=1
(β
j
X
j
) + ε) with Gaussian errors we get:
BIC =
1
n
RSS + ln(n)pˆσ
2
(7)
Aarhus University Biostatistics - Why? What? How? 9 / 33
Model Selection Bayesian Information Criterion (BIC)
Bayesian Information Criterion (BIC)
In R: BIC() in base R
Definition:
Estimate of test mean squared error of regression model fit using
maximum likelihood estimation.
Generally penalizes complex models more than other metrics!
Calculation:
BIC = ln(n)p + 2ln(L(
ˆ
θ)) (6)
n Number of samples
p Number of parameters
L(
ˆ
θ) Maximum value of model likelihood function
For the standard linear model (Y = β
0
+
p
P
j=1
(β
j
X
j
) + ε) with Gaussian errors we get:
BIC =
1
n
RSS + ln(n)pˆσ
2
(7)
Aarhus University Biostatistics - Why? What? How? 9 / 33
Model Selection Receiver-Operator Characteristic (ROC)
Receiver-Operator Characteristic (ROC)
In R: ROC() in Epi package
Definition: Multiple metrics estimating classification accuracy.
Highlights Trade-Off between Sensitivity (rate of true positives) and
Specificity (rate of true negatives)
Calculation:
Specif icity =
T N
T N + F P
(8)
Sensitivity =
T P
T P + F N
(9)
T N Number of true negative assignments
F P Number of false positive assignments
T P Number of true positive assignments
F N Number of false negative assignments
The AUC of the ROC curve is indicative of how well the model performs overall. Higher
scores represent better accuracy.
Aarhus University Biostatistics - Why? What? How? 10 / 33
Model Selection Receiver-Operator Characteristic (ROC)
Receiver-Operator Characteristic (ROC)
In R: ROC() in Epi package
Definition: Multiple metrics estimating classification accuracy.
Highlights Trade-Off between Sensitivity (rate of true positives) and
Specificity (rate of true negatives)
Calculation:
Specif icity =
T N
T N + F P
(8)
Sensitivity =
T P
T P + F N
(9)
T N Number of true negative assignments
F P Number of false positive assignments
T P Number of true positive assignments
F N Number of false negative assignments
The AUC of the ROC curve is indicative of how well the model performs overall. Higher
scores represent better accuracy.
Aarhus University Biostatistics - Why? What? How? 10 / 33
Model Validation
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 11 / 33
Model Validation
What? Why? How?
What - Asses Model Inference:
How well do our models predict
outcomes Y given inputs X?
Why - to quantify how much we trust
our models
Placing a lot of trust in a
non-validated model can have
terrible consequences
Comparing how much to trust
different models can help us chose
the better model or weigh
predictions according to accuracy
How - Model Validation:
Training/Test Data Approach
Leave-One-Out Cross-Validation
(LOOCV)
k-Fold Cross-Validation (k-fold CV)
Bootstrap
...
Aarhus University Biostatistics - Why? What? How? 12 / 33
Model Validation
What? Why? How?
What - Asses Model Inference:
How well do our models predict
outcomes Y given inputs X?
Why - to quantify how much we trust
our models
Placing a lot of trust in a
non-validated model can have
terrible consequences
Comparing how much to trust
different models can help us chose
the better model or weigh
predictions according to accuracy
How - Model Validation:
Training/Test Data Approach
Leave-One-Out Cross-Validation
(LOOCV)
k-Fold Cross-Validation (k-fold CV)
Bootstrap
...
Aarhus University Biostatistics - Why? What? How? 12 / 33
Model Validation
What? Why? How?
What - Asses Model Inference:
How well do our models predict
outcomes Y given inputs X?
Why - to quantify how much we trust
our models
Placing a lot of trust in a
non-validated model can have
terrible consequences
Comparing how much to trust
different models can help us chose
the better model or weigh
predictions according to accuracy
How - Model Validation:
Training/Test Data Approach
Leave-One-Out Cross-Validation
(LOOCV)
k-Fold Cross-Validation (k-fold CV)
Bootstrap
...
Aarhus University Biostatistics - Why? What? How? 12 / 33
Model Validation
What? Why? How?
What - Asses Model Inference:
How well do our models predict
outcomes Y given inputs X?
Why - to quantify how much we trust
our models
Placing a lot of trust in a
non-validated model can have
terrible consequences
Comparing how much to trust
different models can help us chose
the better model or weigh
predictions according to accuracy
How - Model Validation:
Training/Test Data Approach
Leave-One-Out Cross-Validation
(LOOCV)
k-Fold Cross-Validation (k-fold CV)
Bootstrap
...
Aarhus University Biostatistics - Why? What? How? 12 / 33
Model Validation Cross-Validation
Training/Test Data
Procedure:
1 Randomly split the data into training and test (also known as hold-out) parts.
2 Use the training part to build each possible model.
3 For each model, use the test part to calculate the test error rate.
4 Choose the model that gave the lowest test error rate.
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!!""!! #!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!& !
Aarhus University Biostatistics - Why? What? How? 13 / 33
Model Validation Cross-Validation
Training/Test Data
Procedure:
1 Randomly split the data into training and test (also known as hold-out) parts.
2 Use the training part to build each possible model.
3 For each model, use the test part to calculate the test error rate.
4 Choose the model that gave the lowest test error rate.
Drawbacks:
The test error can be highly variable on different sampling splits.
Only part of the observations are used to fit the model (training data). Statistical
methods tend to have higher bias when trained on fewer observations.
Also known as Validation Data Cross-Validation.
Aarhus University Biostatistics - Why? What? How? 13 / 33
Model Validation Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)
Procedure:
1 Split data into training (n 1 observations) and test (1 observation) parts.
2 For i in 1, ..., n:
1 Fit the model on training part and obtain ˆy
1
for x
1
in the test part.
2 Compute the corresponding test error, denoted as M SE
i
.
3 Compute the final MSE for the each candidate model: CV
(n)
=
1
n
n
P
i=1
MSE
i
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
%!
%!
%!
Aarhus University Biostatistics - Why? What? How? 14 / 33
Model Validation Cross-Validation
Leave-One-Out Cross-Validation (LOOCV)
Procedure:
1 Split data into training (n 1 observations) and test (1 observation) parts.
2 For i in 1, ..., n:
1 Fit the model on training part and obtain ˆy
1
for x
1
in the test part.
2 Compute the corresponding test error, denoted as M SE
i
.
3 Compute the final MSE for the each candidate model: CV
(n)
=
1
n
n
P
i=1
MSE
i
Advantages over the validation set approach:
Far less bias. Tends not to overestimate the test error rate as much as the
validation set approach does.
Performing LOOCV multiple times will always yield the same results - there is no
randomness in the training/validation set splits.
Drawbacks:
Computational intensity (every model needs to be fit n 1 times)!
Aarhus University Biostatistics - Why? What? How? 14 / 33
Model Validation Cross-Validation
k-Fold Cross-Validation (k-fold CV)
Procedure:
1 For each candidate model:
1 Fit model on K 1 training parts, compute error (MSE) on the test part.
2 Repeat above step K times for different test parts resulting in
MSE
1
, ..., MSE
k
.
3 Calculate the corresponding k-fold CV value: CV
(k)
=
1
k
k
P
i=1
MSE
i
2 Choose the model with the lowest CV
(k)
!"!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!$!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%! !%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
!%&!'!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!(%!
Aarhus University Biostatistics - Why? What? How? 15 / 33
Model Validation Cross-Validation
k-Fold Cross-Validation (k-fold CV)
Procedure:
1 For each candidate model:
1 Fit model on K 1 training parts, compute error (MSE) on the test part.
2 Repeat above step K times for different test parts resulting in
MSE
1
, ..., MSE
k
.
3 Calculate the corresponding k-fold CV value: CV
(k)
=
1
k
k
P
i=1
MSE
i
2 Choose the model with the lowest CV
(k)
Advantage over LOOCV:
Much less computationally expensive!
LOOCV is k-fold CV with k = n.
Aarhus University Biostatistics - Why? What? How? 15 / 33
Model Validation Bootstrap
Bootstrap
Procedure:
1 Treat the observed sample x = (x
1
, x
2
, ..., x
n
) as population.
2 Obtain bootstrap sample x
= (x
1
, x
2
, ..., x
n
) by resampling with replacement.
3
Repeat above step
B
times to receive
B
bootstrap samples, build models for each
sample and estimate model parameters.
2.8 5.3 3
1.1 2.1 2
2.4 4.3 1
Y X Obs
2.8 5.3 3
2.4 4.3 1
2.8 5.3 3
Y X Obs
2.4 4.3 1
2.8 5.3 3
1.1 2.1 2
Y X Obs
2.4 4.3 1
1.1 2.1 2
1.1 2.1 2
Y X Obs
Original Data (Z)
1*
Z
2*
Z
Z
*B
1*
ˆ
α
2*
ˆ
α
ˆ
α
*B
!!
!!
!!
!!
!
!!
!!
!!
!!
!!
!!
!!
!!
Aarhus University Biostatistics - Why? What? How? 16 / 33
Model Validation Bootstrap
Bootstrap
Procedure:
1 Treat the observed sample x = (x
1
, x
2
, ..., x
n
) as population.
2 Obtain bootstrap sample x
= (x
1
, x
2
, ..., x
n
) by resampling with replacement.
3
Repeat above step
B
times to receive
B
bootstrap samples, build models for each
sample and estimate model parameters.
Advantages:
Very flexible in its application to different methods.
Allows assessments of parameter uncertainty.
Bootstrap estimates of a sampling distribution are analogous to histogram: one
constructs a histogram of the available sample to obtain an estimate of the shape of the
density function.
Aarhus University Biostatistics - Why? What? How? 16 / 33
Building Models
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 17 / 33
Building Models Subset Selection
Best Subset Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p:
1 Fit all p
(k)
=
n!
k!(nk)!
models that contain exactly k predictors.
2 Pick the best among these p
(k)
models, and call it M
k
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Low RSS or a high
R
2
indicates a model with a low training error, whereas a
good model is characterized by a low test error rate.
Advantages:
Simple and conceptually appealing approach.
Drawbacks:
Suffers from computational limitations and becomes computationally
unfeasible for p > 40.
Aarhus University Biostatistics - Why? What? How? 18 / 33
Building Models Subset Selection
Best Subset Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p:
1 Fit all p
(k)
=
n!
k!(nk)!
models that contain exactly k predictors.
2 Pick the best among these p
(k)
models, and call it M
k
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Low RSS or a high
R
2
indicates a model with a low training error, whereas a
good model is characterized by a low test error rate.
Advantages:
Simple and conceptually appealing approach.
Drawbacks:
Suffers from computational limitations and becomes computationally
unfeasible for p > 40.
Aarhus University Biostatistics - Why? What? How? 18 / 33
Building Models Subset Selection
Best Subset Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p:
1 Fit all p
(k)
=
n!
k!(nk)!
models that contain exactly k predictors.
2 Pick the best among these p
(k)
models, and call it M
k
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Low RSS or a high
R
2
indicates a model with a low training error, whereas a
good model is characterized by a low test error rate.
Advantages:
Simple and conceptually appealing approach.
Drawbacks:
Suffers from computational limitations and becomes computationally
unfeasible for p > 40.
Aarhus University Biostatistics - Why? What? How? 18 / 33
Building Models Subset Selection
Best Subset Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p:
1 Fit all p
(k)
=
n!
k!(nk)!
models that contain exactly k predictors.
2 Pick the best among these p
(k)
models, and call it M
k
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Low RSS or a high
R
2
indicates a model with a low training error, whereas a
good model is characterized by a low test error rate.
Advantages:
Simple and conceptually appealing approach.
Drawbacks:
Suffers from computational limitations and becomes computationally
unfeasible for p > 40.
Aarhus University Biostatistics - Why? What? How? 18 / 33
Building Models Subset Selection
Forward Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p 1:
1 Consider all p k models that one predictor to M
k
.
2 Choose the best among these p k models, and call it M
k+1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 19 / 33
Building Models Subset Selection
Forward Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p 1:
1 Consider all p k models that one predictor to M
k
.
2 Choose the best among these p k models, and call it M
k+1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 19 / 33
Building Models Subset Selection
Forward Selection
Let M
0
denote the null model, which contains no predictors.
1 For k = 1, 2, ..., p 1:
1 Consider all p k models that one predictor to M
k
.
2 Choose the best among these p k models, and call it M
k+1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 19 / 33
Building Models Subset Selection
Backward Selection
Let M
p
denote the full model, which contains p predictors.
1 For k = p 1, p 2, ..., 1:
1 Consider all k models that contain all but one of the predictors in M
k
, for a
total of k 1 predictors.
2 Choose the best among these k models, and call it M
k1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 20 / 33
Building Models Subset Selection
Backward Selection
Let M
p
denote the full model, which contains p predictors.
1 For k = p 1, p 2, ..., 1:
1 Consider all k models that contain all but one of the predictors in M
k
, for a
total of k 1 predictors.
2 Choose the best among these k models, and call it M
k1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 20 / 33
Building Models Subset Selection
Backward Selection
Let M
p
denote the full model, which contains p predictors.
1 For k = p 1, p 2, ..., 1:
1 Consider all k models that contain all but one of the predictors in M
k
, for a
total of k 1 predictors.
2 Choose the best among these k models, and call it M
k1
.
2 Select a single best model from among M
0
, ..., M
p
using cross-validated
prediction error, C
p
(AIC), BIC, or adjusted R
2
.
Advantages over Best Subset Selection:
Reduced computational expense. Only considers
1 +
p1
P
k=0
(p k) = 1 + p(p + 1)/2 models instead of 2
p
.
Drawbacks:
Not guaranteed to find the best possible model out of all 2
p
models
containing subsets of the p predictors.
Aarhus University Biostatistics - Why? What? How? 20 / 33
Building Models Shrinkage Methods
Shrinkage - What Do I Use It For?
Shrinking extreme values towards a central value results in a better estimate
of the true mean.
Why?
More stable parameter estimates
(less extreme outliers considered)
Reduction of sampling and
non-sampling errors
Disadvantages
Erroneous estimates if population
has atypical mean. Knowing when
this is the case is difficult.
Possible introduction of bias.
Shrunk models may fit new data
worse than original models would.
How?
Fitting a model with all
p
predictors
Shrink estimated coefficients
towards zero relative to the least
squares estimates
Depending on what type of shrinkage is
performed, some of the coefficients
may be estimated to be exactly zero.
Hence, shrinkage methods can also
perform variable selection.
Aarhus University Biostatistics - Why? What? How? 21 / 33
Building Models Shrinkage Methods
Shrinkage - What Do I Use It For?
Shrinking extreme values towards a central value results in a better estimate
of the true mean.
Why?
More stable parameter estimates
(less extreme outliers considered)
Reduction of sampling and
non-sampling errors
Disadvantages
Erroneous estimates if population
has atypical mean. Knowing when
this is the case is difficult.
Possible introduction of bias.
Shrunk models may fit new data
worse than original models would.
How?
Fitting a model with all
p
predictors
Shrink estimated coefficients
towards zero relative to the least
squares estimates
Depending on what type of shrinkage is
performed, some of the coefficients
may be estimated to be exactly zero.
Hence, shrinkage methods can also
perform variable selection.
Aarhus University Biostatistics - Why? What? How? 21 / 33
Building Models Shrinkage Methods
Shrinkage - What Do I Use It For?
Shrinking extreme values towards a central value results in a better estimate
of the true mean.
Why?
More stable parameter estimates
(less extreme outliers considered)
Reduction of sampling and
non-sampling errors
Disadvantages
Erroneous estimates if population
has atypical mean. Knowing when
this is the case is difficult.
Possible introduction of bias.
Shrunk models may fit new data
worse than original models would.
How?
Fitting a model with all
p
predictors
Shrink estimated coefficients
towards zero relative to the least
squares estimates
Depending on what type of shrinkage is
performed, some of the coefficients
may be estimated to be exactly zero.
Hence, shrinkage methods can also
perform variable selection.
Aarhus University Biostatistics - Why? What? How? 21 / 33
Building Models Shrinkage Methods
Shrinkage - What Do I Use It For?
Shrinking extreme values towards a central value results in a better estimate
of the true mean.
Why?
More stable parameter estimates
(less extreme outliers considered)
Reduction of sampling and
non-sampling errors
Disadvantages
Erroneous estimates if population
has atypical mean. Knowing when
this is the case is difficult.
Possible introduction of bias.
Shrunk models may fit new data
worse than original models would.
How?
Fitting a model with all
p
predictors
Shrink estimated coefficients
towards zero relative to the least
squares estimates
Depending on what type of shrinkage is
performed, some of the coefficients
may be estimated to be exactly zero.
Hence, shrinkage methods can also
perform variable selection.
Aarhus University Biostatistics - Why? What? How? 21 / 33
Building Models Shrinkage Methods
Shrinkage - What Do I Use It For?
Shrinking extreme values towards a central value results in a better estimate
of the true mean.
Why?
More stable parameter estimates
(less extreme outliers considered)
Reduction of sampling and
non-sampling errors
Disadvantages
Erroneous estimates if population
has atypical mean. Knowing when
this is the case is difficult.
Possible introduction of bias.
Shrunk models may fit new data
worse than original models would.
How?
Fitting a model with all
p
predictors
Shrink estimated coefficients
towards zero relative to the least
squares estimates
Depending on what type of shrinkage is
performed, some of the coefficients
may be estimated to be exactly zero.
Hence, shrinkage methods can also
perform variable selection.
Aarhus University Biostatistics - Why? What? How? 21 / 33
Building Models Shrinkage Methods
Ridge Regression
The ridge regression coefficient estimates,
ˆ
β
R
, are the values that minimize
RSS + λ
p
X
j=1
β
2
j
=
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
β
2
j
(10)
Equation 10 trades off two different criteria:
Coefficient estimates that fit the data well, by making the RSS small.
The shrinkage penalty (
λ
P
j
β
2
j
) is small when
β
0
, β
1
, ..., β
p
are close to zero, thus
the shrinking penalty forces the estimates of β
j
towards zero.
The tuning parameter λ controls the relative impact of these two terms on the
regression coefficient estimates. When
λ = 0
, the penalty term has no effect, and ridge
regression will produce the least squares estimates. As λ , the impact of the
shrinkage penalty grows, and the ridge regression coefficient estimates will approach
zero (decreased variance but increased bias).
Aarhus University Biostatistics - Why? What? How? 22 / 33
Building Models Shrinkage Methods
Ridge Regression
The ridge regression coefficient estimates,
ˆ
β
R
, are the values that minimize
RSS + λ
p
X
j=1
β
2
j
=
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
β
2
j
(10)
Equation 10 trades off two different criteria:
Coefficient estimates that fit the data well, by making the RSS small.
The shrinkage penalty (
λ
P
j
β
2
j
) is small when
β
0
, β
1
, ..., β
p
are close to zero, thus
the shrinking penalty forces the estimates of β
j
towards zero.
The tuning parameter λ controls the relative impact of these two terms on the
regression coefficient estimates. When
λ = 0
, the penalty term has no effect, and ridge
regression will produce the least squares estimates. As λ , the impact of the
shrinkage penalty grows, and the ridge regression coefficient estimates will approach
zero (decreased variance but increased bias).
Aarhus University Biostatistics - Why? What? How? 22 / 33
Building Models Shrinkage Methods
Ridge Regression
The ridge regression coefficient estimates,
ˆ
β
R
, are the values that minimize
RSS + λ
p
X
j=1
β
2
j
=
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
β
2
j
(10)
Equation 10 trades off two different criteria:
Coefficient estimates that fit the data well, by making the RSS small.
The shrinkage penalty (
λ
P
j
β
2
j
) is small when
β
0
, β
1
, ..., β
p
are close to zero, thus
the shrinking penalty forces the estimates of β
j
towards zero.
The tuning parameter λ controls the relative impact of these two terms on the
regression coefficient estimates. When
λ = 0
, the penalty term has no effect, and ridge
regression will produce the least squares estimates. As λ , the impact of the
shrinkage penalty grows, and the ridge regression coefficient estimates will approach
zero (decreased variance but increased bias).
Aarhus University Biostatistics - Why? What? How? 22 / 33
Building Models Shrinkage Methods
The Lasso
The lasso coefficients,
ˆ
β
L
λ
, minimize the quantity
RSS + λ
p
X
j=1
|β
j
| =
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
|β
j
| (11)
The β
2
j
term in the ridge regression penalty has been replaced by |β
j
| in the lasso.
The penalty
|β
j
|
has the effect of forcing some of the coefficient estimates to be exactly
0 when the tuning parameter λ is sufficiently large.
The lasso performs variable selection.
Models generated from the lasso (also referred to as sparse models) are generally
much easier to interpret than those produced by ridge regression.
Aarhus University Biostatistics - Why? What? How? 23 / 33
Building Models Shrinkage Methods
The Lasso
The lasso coefficients,
ˆ
β
L
λ
, minimize the quantity
RSS + λ
p
X
j=1
|β
j
| =
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
|β
j
| (11)
The β
2
j
term in the ridge regression penalty has been replaced by |β
j
| in the lasso.
The penalty
|β
j
|
has the effect of forcing some of the coefficient estimates to be exactly
0 when the tuning parameter λ is sufficiently large.
The lasso performs variable selection.
Models generated from the lasso (also referred to as sparse models) are generally
much easier to interpret than those produced by ridge regression.
Aarhus University Biostatistics - Why? What? How? 23 / 33
Building Models Shrinkage Methods
The Lasso
The lasso coefficients,
ˆ
β
L
λ
, minimize the quantity
RSS + λ
p
X
j=1
|β
j
| =
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
|β
j
| (11)
The β
2
j
term in the ridge regression penalty has been replaced by |β
j
| in the lasso.
The penalty
|β
j
|
has the effect of forcing some of the coefficient estimates to be exactly
0 when the tuning parameter λ is sufficiently large.
The lasso performs variable selection.
Models generated from the lasso (also referred to as sparse models) are generally
much easier to interpret than those produced by ridge regression.
Aarhus University Biostatistics - Why? What? How? 23 / 33
Building Models Shrinkage Methods
The Lasso
The lasso coefficients,
ˆ
β
L
λ
, minimize the quantity
RSS + λ
p
X
j=1
|β
j
| =
n
X
i=1
y
i
β
0
p
X
j=1
(β
j
x
i,j
)
+ λ
p
X
j=1
|β
j
| (11)
The β
2
j
term in the ridge regression penalty has been replaced by |β
j
| in the lasso.
The penalty
|β
j
|
has the effect of forcing some of the coefficient estimates to be exactly
0 when the tuning parameter λ is sufficiently large.
The lasso performs variable selection.
Models generated from the lasso (also referred to as sparse models) are generally
much easier to interpret than those produced by ridge regression.
Aarhus University Biostatistics - Why? What? How? 23 / 33
Building Models Shrinkage Methods
Ridge vs. Lasso
Figure 1: Error and constrain functions of the lasso and ridge regression: Both plots present
a situation where
p = 2
. Contours of the error and constraint functions for the lasso (left) and ridge
regression (right). The solid blue areas are the constraint regions, |β
1
| + |β
2
| s and
β
2
1
+ β
2
2
s, while the red ellipses are the contours of the RSS.
Aarhus University Biostatistics - Why? What? How? 24 / 33
Statistical Significance
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 25 / 33
Statistical Significance The p-value Conundrum
The p-value Conundrum
"The p-value is the probability of randomly obtaining an effect at least as
extreme as the one in your sample data, given the null hypothesis."
Misconceptions
The p-value is not designed to tell
us whether something is strictly
true or false
It is not the probability of the null
hypothesis being true
The size of p does not yield any
information about the strength of
an observed effect
Mathematical Quirks
It varies strongly from
sample-to-sample (depending on
statistical power of the set-up)
If the sample size is big enough,
the p-value will always be below
the .05 cut-off, no matter the
magnitude of the effect
Aarhus University Biostatistics - Why? What? How? 26 / 33
Statistical Significance The p-value Conundrum
The p-value Conundrum
"The p-value is the probability of randomly obtaining an effect at least as
extreme as the one in your sample data, given the null hypothesis."
Misconceptions
The p-value is not designed to tell
us whether something is strictly
true or false
It is not the probability of the null
hypothesis being true
The size of p does not yield any
information about the strength of
an observed effect
Mathematical Quirks
It varies strongly from
sample-to-sample (depending on
statistical power of the set-up)
If the sample size is big enough,
the p-value will always be below
the .05 cut-off, no matter the
magnitude of the effect
Aarhus University Biostatistics - Why? What? How? 26 / 33
Statistical Significance The p-value Conundrum
The p-value Conundrum
"The p-value is the probability of randomly obtaining an effect at least as
extreme as the one in your sample data, given the null hypothesis."
Misconceptions
The p-value is not designed to tell
us whether something is strictly
true or false
It is not the probability of the null
hypothesis being true
The size of p does not yield any
information about the strength of
an observed effect
Mathematical Quirks
It varies strongly from
sample-to-sample (depending on
statistical power of the set-up)
If the sample size is big enough,
the p-value will always be below
the .05 cut-off, no matter the
magnitude of the effect
Aarhus University Biostatistics - Why? What? How? 26 / 33
Statistical Significance Alternatives
Effect sizes
"A measure of the magnitude of a statistical effect within the data (i.e.
values calculated from test statistics)."
Nakagawa & Cuthill (2007). Effect size, confidence interval and statistical significance: A practical guide for biologists. Biological Reviews.
Intuitive to interpret and often what we are interested in
Three types for most situations:
r statistics (correlations)
d statistics (comparisons of values)
OR (odds ratio) statistics (risk measurements)
These are point estimates
Need to be reported alongside some information of credibility
These are usually standardized thus enabling meta-studies
In R: https://cran.r-project.org/web/packages/compute.es/compute.es.pdf and
https://cran.r-project.org/web/packages/effsize/effsize.pdf
Aarhus University Biostatistics - Why? What? How? 27 / 33
Statistical Significance Alternatives
Confidence Intervals
"Confidence intervals (CIs) answer the questions: ’How strong is the
effect’ and ’How accurate is that estimate of the population effect’."
Halsey (2019). The reign of the p-value is over: what alternative analyses could we employ to fill the power vacuum? Biology Letters.
Intuitive to interpret
Answers the questions we are most interested in
Does not require additional information of statistical certainty
Combines point estimates and range estimates
Removes some of the pressure of the "file drawer problem"
Shares the same mathematical framework as the p-value calculation
Especially useful in data visualization
In R, many functions come with in-built ways of establishing CIs.
Aarhus University Biostatistics - Why? What? How? 28 / 33
Statistical Significance Alternatives
Akaike Information Criterion (AIC)
The Akaike Information Criterion (AIC) is a indicator of model fit.
Burnham et al. (2011). AIC model selection and multimodel inference in behavioral ecology: Some background, observations, and comparisons. Behavioral
Ecology and Sociobiology.
Used for model selection and comparison
Lower AICs indicate better model fit
One can establish contrasting models adhering to different hypothesis and
identify which model suits the data best
A proper hypothesis selection tool
Model selection often comes with some degree of uncertainty
Can be misused in step-wise model building procedures
In R, most model outputs can be assessed using the AIC() function.
Aarhus University Biostatistics - Why? What? How? 29 / 33
Statistical Significance Alternatives
Bayes Factor
" The minimum Bayes factor is simply the exponential of the difference
between the log-likelihoods of two competing models."
Goodman (2001). Of P-Values and Bayes: A Modest Proposal. Epidemiology.
Intuitive to interpret (Bayes Factor of 1/10 means that our study
decreased the relative odds of the null hypothesis being true tenfold)
Uses prior information to establish expected likelihoods thus enabling a
progression in science
In R: https://cran.r-project.org/web/packages/BayesFactor/BayesFactor.pdf or direct Bayesian
Statistics using JAGS or STAN (for example)
Aarhus University Biostatistics - Why? What? How? 30 / 33
Summary
1 Model Selection
(adjusted) R
2
Mallow’s C
p
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Receiver-Operator Characteristic (ROC)
2 Model Validation
Cross-Validation
Bootstrap
3 Building Models
Subset Selection
Shrinkage Methods
4 Statistical Significance
The p-value Conundrum
Alternatives
5 Summary
What Now?
Aarhus University Biostatistics - Why? What? How? 31 / 33
Summary What Now?
Summary
1 “Which model explains my data better?” Model Comparison
Measures of fit, which penalise complex models (e.g.: Adjusted R
2
, Mallow’s
C
p
)
Information criteria (e.g. AUC, BIC, ROC)
Practice model comparison not model selection
2 “How good is my model at predicting things?” Model Validation
Cross-Validation
Boot-Strapping
3 “Which parameters should my model include?” Model Building
Model comparison/Lasso for subset selection
Shrinkage for robust parameter estimates
4 “What do I report?” Statistical Significance
Don’t use p-values!
Report Intervals and Effect Sizes.
Aarhus University Biostatistics - Why? What? How? 32 / 33
Summary What Now?
Where do we go from here?
"Treat statistics as a science, and not a
recipe"
Andrew Vickers
"The numbers are where the scientific
discussion should start, not end!"
Regina Nuzzo
Aarhus University Biostatistics - Why? What? How? 33 / 33