STATISTICAL TERMINOLOGY
The Basics, Misconceptions, and Pedantises
Erik Kusch
erik.kusch@au.dk
Section for Ecoinformatics & Biodiversity
Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)
Aarhus University
04/03/2020
Aarhus University Biostatistics - Why? What? How? 1 / 38
1 Biostatical Terms
Population vs. Sample
Test- vs. Training-Data
Randomness
Supervised vs. Unsupervised Approaches
2 Variables & Scales
Basics of Variables
Variables And Scales
3 Distributions
The Basics of Distributions
Normality
What Distributions To Consider
Important Measures Of Distributions
Aarhus University Biostatistics - Why? What? How? 2 / 38
Biostatical Terms
1 Biostatical Terms
Population vs. Sample
Test- vs. Training-Data
Randomness
Supervised vs. Unsupervised Approaches
2 Variables & Scales
Basics of Variables
Variables And Scales
3 Distributions
The Basics of Distributions
Normality
What Distributions To Consider
Important Measures Of Distributions
Aarhus University Biostatistics - Why? What? How? 3 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Population vs. Sample
Population vs. Sample
Population: describes the sum total of
all existing values of a variable given a
certain research question. This
includes non-measured data.
Sample: describes the sum total of all
available values of a variable for any
given analysis. This can only include
measured data.
An example:
In an experimental set-up, you rear an ant colony of exactly 10,000 individuals.
You are interested in the average mandible strength of ants within the colony.
The problem: You cannot possibly take measurements of all 10,000 individuals.
The solution: Taking measurements on a
Sample
(e.g. 1,000 individuals) from
within the Population (10,000 individuals).
Aarhus University Biostatistics - Why? What? How? 4 / 38
Biostatical Terms Test- vs. Training-Data
Test- vs. Training-Data
This differentiation is only applicable when concerned with modelling.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 5 / 38
Biostatical Terms Test- vs. Training-Data
Test- vs. Training-Data
This differentiation is only applicable when concerned with modelling.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 5 / 38
Biostatical Terms Test- vs. Training-Data
Test- vs. Training-Data
This differentiation is only applicable when concerned with modelling.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 5 / 38
Biostatical Terms Test- vs. Training-Data
Test- vs. Training-Data
This differentiation is only applicable when concerned with modelling.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 5 / 38
Biostatical Terms Test- vs. Training-Data
Test- vs. Training-Data
This differentiation is only applicable when concerned with modelling.
Training Data:
describes the subset of
the total data which is used to
establish/train the model.
Test Data: describes the subset of the
total data which is used to test the
performance of the model.
The problem: You have identified a way to model how mandible strength and
ant size are interconnected but don’t know how to assess the quality of your
model (a model will always fit the data it was built on extremely well).
The solution: Split the available data into two non-overlapping subsets of data
(Training and Test Data) and use these separately to build your model and
assess its performance.
Aarhus University Biostatistics - Why? What? How? 5 / 38
Biostatical Terms Randomness
Randomness
Randomisation is one of the most important practices in biological
studies.
A
sampling
procedure is
random
when any member of the population has an
equal chance of being selected into the sample.
Training and Test Data Sets are established from the population with the same
sense of randomness although there may be exceptions depending on the
modelling procedure at hand.
Data collection: Number all units
contained within the set-up and sample
those units corresponding to random
numbers.
In R: Use the sample() function to
create truly random subsets.
Remember to use set.seed() to
make this step reproducible!
Aarhus University Biostatistics - Why? What? How? 6 / 38
Biostatical Terms Randomness
Randomness
Randomisation is one of the most important practices in biological
studies.
A
sampling
procedure is
random
when any member of the population has an
equal chance of being selected into the sample.
Training and Test Data Sets are established from the population with the same
sense of randomness although there may be exceptions depending on the
modelling procedure at hand.
Data collection: Number all units
contained within the set-up and sample
those units corresponding to random
numbers.
In R: Use the sample() function to
create truly random subsets.
Remember to use set.seed() to
make this step reproducible!
Aarhus University Biostatistics - Why? What? How? 6 / 38
Biostatical Terms Randomness
Randomness
Randomisation is one of the most important practices in biological
studies.
A
sampling
procedure is
random
when any member of the population has an
equal chance of being selected into the sample.
Training and Test Data Sets are established from the population with the same
sense of randomness although there may be exceptions depending on the
modelling procedure at hand.
Data collection: Number all units
contained within the set-up and sample
those units corresponding to random
numbers.
In R: Use the sample() function to
create truly random subsets.
Remember to use set.seed() to
make this step reproducible!
Aarhus University Biostatistics - Why? What? How? 6 / 38
Biostatical Terms Randomness
Randomness
Randomisation is one of the most important practices in biological
studies.
A
sampling
procedure is
random
when any member of the population has an
equal chance of being selected into the sample.
Training and Test Data Sets are established from the population with the same
sense of randomness although there may be exceptions depending on the
modelling procedure at hand.
Data collection: Number all units
contained within the set-up and sample
those units corresponding to random
numbers.
In R: Use the sample() function to
create truly random subsets.
Remember to use set.seed() to
make this step reproducible!
Aarhus University Biostatistics - Why? What? How? 6 / 38
Biostatical Terms Randomness
Stratified Sampling
When do we break true randomness?.
When a population can be divided into distinct categories (i.e. strata). These
can be regarded as individual sub-populations.
Stratified sampling ensures that all sub-populations are proportionally
represented in the final population-sample given their relative size.
d
## s Freq
## 1 A 50
## 2 B 35
## 3 C 15
set.seed(42) # stratified
table(sample(d$s, replace = TRUE, prob = d$Freq, 100))
##
## A B C
## 45 38 17
set.seed(42) # non-stratified
table(sample(d$s, replace = TRUE, 100))
##
## A B C
## 40 39 21
Aarhus University Biostatistics - Why? What? How? 7 / 38
Biostatical Terms Randomness
Stratified Sampling
When do we break true randomness?.
When a population can be divided into distinct categories (i.e. strata). These
can be regarded as individual sub-populations.
Stratified sampling ensures that all sub-populations are proportionally
represented in the final population-sample given their relative size.
d
## s Freq
## 1 A 50
## 2 B 35
## 3 C 15
set.seed(42) # stratified
table(sample(d$s, replace = TRUE, prob = d$Freq, 100))
##
## A B C
## 45 38 17
set.seed(42) # non-stratified
table(sample(d$s, replace = TRUE, 100))
##
## A B C
## 40 39 21
Aarhus University Biostatistics - Why? What? How? 7 / 38
Biostatical Terms Randomness
Stratified Sampling
When do we break true randomness?.
When a population can be divided into distinct categories (i.e. strata). These
can be regarded as individual sub-populations.
Stratified sampling ensures that all sub-populations are proportionally
represented in the final population-sample given their relative size.
d
## s Freq
## 1 A 50
## 2 B 35
## 3 C 15
set.seed(42) # stratified
table(sample(d$s, replace = TRUE, prob = d$Freq, 100))
##
## A B C
## 45 38 17
set.seed(42) # non-stratified
table(sample(d$s, replace = TRUE, 100))
##
## A B C
## 40 39 21
Aarhus University Biostatistics - Why? What? How? 7 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Unsupervised Approaches
Unsupervised methods are often used to select the most informative X input
variables for supervised approaches.
Pre-requisites:
Only input variables are observed.
No solution/feedback (output) is
given.
Aims:
Divide the observations into
relatively distinct groups.
Model the underlying structure or
distribution in the data.
"Pre-processing" before a supervised learning analysis and
exploratory analyses
Aarhus University Biostatistics - Why? What? How? 8 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Unsupervised Approaches
Unsupervised methods are often used to select the most informative X input
variables for supervised approaches.
Pre-requisites:
Only input variables are observed.
No solution/feedback (output) is
given.
Aims:
Divide the observations into
relatively distinct groups.
Model the underlying structure or
distribution in the data.
"Pre-processing" before a supervised learning analysis and
exploratory analyses
Aarhus University Biostatistics - Why? What? How? 8 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Unsupervised Approaches
Unsupervised methods are often used to select the most informative X input
variables for supervised approaches.
Pre-requisites:
Only input variables are observed.
No solution/feedback (output) is
given.
Aims:
Divide the observations into
relatively distinct groups.
Model the underlying structure or
distribution in the data.
"Pre-processing" before a supervised learning analysis and
exploratory analyses
Aarhus University Biostatistics - Why? What? How? 8 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Unsupervised Approaches
Unsupervised methods are often used to select the most informative X input
variables for supervised approaches.
Pre-requisites:
Only input variables are observed.
No solution/feedback (output) is
given.
Aims:
Divide the observations into
relatively distinct groups.
Model the underlying structure or
distribution in the data.
"Pre-processing" before a supervised learning analysis and
exploratory analyses
Aarhus University Biostatistics - Why? What? How? 8 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Supervised Approaches
Supervised methods are often informed by unsupervised approaches and used
to gain validated information about the data.
Pre-requisites:
Both predictors X, and responses
Y
are observed (there is one
y
i
for
each x
i
).
Data is split into Training and Test
Data Sets.
Aims:
Learn a mapping function f from
X to Y .
Validate established
function/model.
Further prediction and inference.
Mostly inferential analyses
Aarhus University Biostatistics - Why? What? How? 9 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Supervised Approaches
Supervised methods are often informed by unsupervised approaches and used
to gain validated information about the data.
Pre-requisites:
Both predictors X, and responses
Y
are observed (there is one
y
i
for
each x
i
).
Data is split into Training and Test
Data Sets.
Aims:
Learn a mapping function f from
X to Y .
Validate established
function/model.
Further prediction and inference.
Mostly inferential analyses
Aarhus University Biostatistics - Why? What? How? 9 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Supervised Approaches
Supervised methods are often informed by unsupervised approaches and used
to gain validated information about the data.
Pre-requisites:
Both predictors X, and responses
Y
are observed (there is one
y
i
for
each x
i
).
Data is split into Training and Test
Data Sets.
Aims:
Learn a mapping function f from
X to Y .
Validate established
function/model.
Further prediction and inference.
Mostly inferential analyses
Aarhus University Biostatistics - Why? What? How? 9 / 38
Biostatical Terms Supervised vs. Unsupervised Approaches
Supervised Approaches
Supervised methods are often informed by unsupervised approaches and used
to gain validated information about the data.
Pre-requisites:
Both predictors X, and responses
Y
are observed (there is one
y
i
for
each x
i
).
Data is split into Training and Test
Data Sets.
Aims:
Learn a mapping function f from
X to Y .
Validate established
function/model.
Further prediction and inference.
Mostly inferential analyses
Aarhus University Biostatistics - Why? What? How? 9 / 38
Variables & Scales
1 Biostatical Terms
Population vs. Sample
Test- vs. Training-Data
Randomness
Supervised vs. Unsupervised Approaches
2 Variables & Scales
Basics of Variables
Variables And Scales
3 Distributions
The Basics of Distributions
Normality
What Distributions To Consider
Important Measures Of Distributions
Aarhus University Biostatistics - Why? What? How? 10 / 38
Variables & Scales Basics of Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 11 / 38
Variables & Scales Basics of Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 11 / 38
Variables & Scales Basics of Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 11 / 38
Variables & Scales Basics of Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 12 / 38
Variables & Scales Basics of Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 12 / 38
Variables & Scales Basics of Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 12 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 13 / 38
Variables & Scales Basics of Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 14 / 38
Variables & Scales Basics of Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 14 / 38
Variables & Scales Basics of Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 14 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 15 / 38
Variables & Scales Basics of Variables
Converting Variable Types
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 16 / 38
Variables & Scales Basics of Variables
Converting Variable Types
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 16 / 38
Variables & Scales Basics of Variables
Converting Variable Types
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 16 / 38
Variables & Scales Basics of Variables
Converting Variable Types
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 16 / 38
Variables & Scales Variables And Scales
Variables On Scales
Another way of classifying variables are the scales they are represented on.
Different scales
of variables
require different statistical procedures
for analyses!
Variable scales include:
Nominal
Binary
Ordinal
Interval
Relation/Ratio
Some statistics books teach integer scales along the above mentioned scales.
Some people dispute this and claim these scales to be ratio scales.
Aarhus University Biostatistics - Why? What? How? 17 / 38
Variables & Scales Variables And Scales
Variables On Scales
Another way of classifying variables are the scales they are represented on.
Different scales
of variables
require different statistical procedures
for analyses!
Variable scales include:
Nominal
Binary
Ordinal
Interval
Relation/Ratio
Some statistics books teach integer scales along the above mentioned scales.
Some people dispute this and claim these scales to be ratio scales.
Aarhus University Biostatistics - Why? What? How? 17 / 38
Variables & Scales Variables And Scales
Variables On Scales
Another way of classifying variables are the scales they are represented on.
Different scales
of variables
require different statistical procedures
for analyses!
Variable scales include:
Nominal
Binary
Ordinal
Interval
Relation/Ratio
Some statistics books teach integer scales along the above mentioned scales.
Some people dispute this and claim these scales to be ratio scales.
Aarhus University Biostatistics - Why? What? How? 17 / 38
Variables & Scales Variables And Scales
Variables On Scales
Another way of classifying variables are the scales they are represented on.
Different scales
of variables
require different statistical procedures
for analyses!
Variable scales include:
Nominal
Binary
Ordinal
Interval
Relation/Ratio
Some statistics books teach integer scales along the above mentioned scales.
Some people dispute this and claim these scales to be ratio scales.
Aarhus University Biostatistics - Why? What? How? 17 / 38
Variables & Scales Variables And Scales
Nominal And Binary
Nominal scales
of variables correspond to categorical variables which cannot
be put into a meaningful order.
Variables on nominal scales put units into distinct
categories
These variables may be numerical but offer no
mathematical interpretation
Examples:
Petal colour (red, green, blue, etc.)
Individual IDs
Binary scales are a special case of nominal scales taking only two possible
values: 0 and 1.
Aarhus University Biostatistics - Why? What? How? 18 / 38
Variables & Scales Variables And Scales
Nominal And Binary
Nominal scales
of variables correspond to categorical variables which cannot
be put into a meaningful order.
Variables on nominal scales put units into distinct
categories
These variables may be numerical but offer no
mathematical interpretation
Examples:
Petal colour (red, green, blue, etc.)
Individual IDs
Binary scales are a special case of nominal scales taking only two possible
values: 0 and 1.
Aarhus University Biostatistics - Why? What? How? 18 / 38
Variables & Scales Variables And Scales
Nominal And Binary
Nominal scales
of variables correspond to categorical variables which cannot
be put into a meaningful order.
Variables on nominal scales put units into distinct
categories
These variables may be numerical but offer no
mathematical interpretation
Examples:
Petal colour (red, green, blue, etc.)
Individual IDs
Binary scales are a special case of nominal scales taking only two possible
values: 0 and 1.
Aarhus University Biostatistics - Why? What? How? 18 / 38
Variables & Scales Variables And Scales
Nominal And Binary
Nominal scales
of variables correspond to categorical variables which cannot
be put into a meaningful order.
Variables on nominal scales put units into distinct
categories
These variables may be numerical but offer no
mathematical interpretation
Examples:
Petal colour (red, green, blue, etc.)
Individual IDs
Binary scales are a special case of nominal scales taking only two possible
values: 0 and 1.
Aarhus University Biostatistics - Why? What? How? 18 / 38
Variables & Scales Variables And Scales
Ordinal
Ordinal scales of variables correspond to categorical variables which can be
put into meaningful order.
Variables on ordinal scales put units into distinct
categories
These variables may be numerical and offer some
mathematical interpretation
Examples:
Size (small, medium, large, etc.)
Binned continuous variables
Aarhus University Biostatistics - Why? What? How? 19 / 38
Variables & Scales Variables And Scales
Ordinal
Ordinal scales of variables correspond to categorical variables which can be
put into meaningful order.
Variables on ordinal scales put units into distinct
categories
These variables may be numerical and offer some
mathematical interpretation
Examples:
Size (small, medium, large, etc.)
Binned continuous variables
Aarhus University Biostatistics - Why? What? How? 19 / 38
Variables & Scales Variables And Scales
Ordinal
Ordinal scales of variables correspond to categorical variables which can be
put into meaningful order.
Variables on ordinal scales put units into distinct
categories
These variables may be numerical and offer some
mathematical interpretation
Examples:
Size (small, medium, large, etc.)
Binned continuous variables
Aarhus University Biostatistics - Why? What? How? 19 / 38
Variables & Scales Variables And Scales
Interval/Discrete
Interval scales of variables correspond to a mix of continuous variables.
Variables on interval scales are measured on equal
intervals from a defined zero point/point of origin
The point of origin
does not imply an absence of
the measured characteristic
Examples:
Temperature C]
pH
Aarhus University Biostatistics - Why? What? How? 20 / 38
Variables & Scales Variables And Scales
Interval/Discrete
Interval scales of variables correspond to a mix of continuous variables.
Variables on interval scales are measured on equal
intervals from a defined zero point/point of origin
The point of origin
does not imply an absence of
the measured characteristic
Examples:
Temperature C]
pH
Aarhus University Biostatistics - Why? What? How? 20 / 38
Variables & Scales Variables And Scales
Interval/Discrete
Interval scales of variables correspond to a mix of continuous variables.
Variables on interval scales are measured on equal
intervals from a defined zero point/point of origin
The point of origin
does not imply an absence of
the measured characteristic
Examples:
Temperature C]
pH
Aarhus University Biostatistics - Why? What? How? 20 / 38
Variables & Scales Variables And Scales
Relation/Ratio
Relation/Ratio scales of variables correspond to continuous variables.
Variables on relation/ratio scales are measured on
equal intervals from a defined zero point/point of
origin
The point of origin
does imply an absence of the
measured characteristic
Examples:
Temperature [K]
Weight
Integer scales are a special case of ratio scales allowing only for integral
numbers.
Aarhus University Biostatistics - Why? What? How? 21 / 38
Variables & Scales Variables And Scales
Relation/Ratio
Relation/Ratio scales of variables correspond to continuous variables.
Variables on relation/ratio scales are measured on
equal intervals from a defined zero point/point of
origin
The point of origin
does imply an absence of the
measured characteristic
Examples:
Temperature [K]
Weight
Integer scales are a special case of ratio scales allowing only for integral
numbers.
Aarhus University Biostatistics - Why? What? How? 21 / 38
Variables & Scales Variables And Scales
Relation/Ratio
Relation/Ratio scales of variables correspond to continuous variables.
Variables on relation/ratio scales are measured on
equal intervals from a defined zero point/point of
origin
The point of origin
does imply an absence of the
measured characteristic
Examples:
Temperature [K]
Weight
Integer scales are a special case of ratio scales allowing only for integral
numbers.
Aarhus University Biostatistics - Why? What? How? 21 / 38
Variables & Scales Variables And Scales
Relation/Ratio
Relation/Ratio scales of variables correspond to continuous variables.
Variables on relation/ratio scales are measured on
equal intervals from a defined zero point/point of
origin
The point of origin
does imply an absence of the
measured characteristic
Examples:
Temperature [K]
Weight
Integer scales are a special case of ratio scales allowing only for integral
numbers.
Aarhus University Biostatistics - Why? What? How? 21 / 38
Variables & Scales Variables And Scales
Confusion Of Units
Aarhus University Biostatistics - Why? What? How? 22 / 38
Distributions
1 Biostatical Terms
Population vs. Sample
Test- vs. Training-Data
Randomness
Supervised vs. Unsupervised Approaches
2 Variables & Scales
Basics of Variables
Variables And Scales
3 Distributions
The Basics of Distributions
Normality
What Distributions To Consider
Important Measures Of Distributions
Aarhus University Biostatistics - Why? What? How? 23 / 38
Distributions The Basics of Distributions
What Are Distributions?
A distribution of a statistical data
set (sample/population) shows
all the possible values/intervals
of the data in question and their
frequency.
Basically, data patterns we are considering/looking for.
Aarhus University Biostatistics - Why? What? How? 24 / 38
Distributions The Basics of Distributions
What Are Distributions?
A distribution of a statistical data
set (sample/population) shows
all the possible values/intervals
of the data in question and their
frequency.
Basically, data patterns we are considering/looking for.
Aarhus University Biostatistics - Why? What? How? 24 / 38
Distributions The Basics of Distributions
Frequency Distributions
Frequency Distributions:
Theory
Simple representations of data
value frequencies
Can be established for every
variable
Practice in R
Visualisation via the ‘hist()‘
function
Frequency Distribution
rnorm(100000, 20, 2)
Frequency
15 20 25
0 5000 10000 15000 20000
hist(rnorm(100000,20,2),
main = "Frequency Distribution")
Aarhus University Biostatistics - Why? What? How? 25 / 38
Distributions The Basics of Distributions
Frequency Distributions
Frequency Distributions:
Theory
Simple representations of data
value frequencies
Can be established for every
variable
Practice in R
Visualisation via the ‘hist()‘
function
Frequency Distribution
rnorm(100000, 20, 2)
Frequency
15 20 25
0 5000 10000 15000 20000
hist(rnorm(100000,20,2),
main = "Frequency Distribution")
Aarhus University Biostatistics - Why? What? How? 25 / 38
Distributions The Basics of Distributions
Probability Density Distributions I
Probability Density Distributions:
Theory
Representation of data value
probabilities
Can be established for
continuous variables
Practice in R
Visualisation via the ‘density()‘
function
15 20 25
0.00 0.05 0.10 0.15 0.20
Probability Density Distribution
N = 100000 Bandwidth = 0.1794
Density
plot(density(rnorm(100000,20,2)),
main = "Probability Density Distribution")
Aarhus University Biostatistics - Why? What? How? 26 / 38
Distributions The Basics of Distributions
Probability Density Distributions I
Probability Density Distributions:
Theory
Representation of data value
probabilities
Can be established for
continuous variables
Practice in R
Visualisation via the ‘density()‘
function
15 20 25
0.00 0.05 0.10 0.15 0.20
Probability Density Distribution
N = 100000 Bandwidth = 0.1794
Density
plot(density(rnorm(100000,20,2)),
main = "Probability Density Distribution")
Aarhus University Biostatistics - Why? What? How? 26 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions The Basics of Distributions
Probability Density Distributions II
Probability Density Distributions hold the majority of importance in
statistics!
A few key points about these distributions:
Area under the curve (AUC) sums to 1
A probability for every given single value is 0
The AUC between two values on the X-axis equals the probability to
randomly sample a value between these two points
Find a masterful explanation of the single-value probability here.
Aarhus University Biostatistics - Why? What? How? 27 / 38
Distributions Normality
Univariate Standard Normal/Gaussian Distribution
One of the most important distributions in natural sciences.
Used to represent real-valued
random variables whose
distributions are not known
The
central limit theorem
applies
(draw a sufficient number of
samples and you end up with the
normal distribution)
These distributions are usually
known also as "bell curves"
(
Attention:
other distributions take
this shape too)
Aarhus University Biostatistics - Why? What? How? 28 / 38
Distributions Normality
Univariate Standard Normal/Gaussian Distribution
One of the most important distributions in natural sciences.
Used to represent real-valued
random variables whose
distributions are not known
The
central limit theorem
applies
(draw a sufficient number of
samples and you end up with the
normal distribution)
These distributions are usually
known also as "bell curves"
(
Attention:
other distributions take
this shape too)
Aarhus University Biostatistics - Why? What? How? 28 / 38
Distributions Normality
Univariate Standard Normal/Gaussian Distribution
One of the most important distributions in natural sciences.
Used to represent real-valued
random variables whose
distributions are not known
The
central limit theorem
applies
(draw a sufficient number of
samples and you end up with the
normal distribution)
These distributions are usually
known also as "bell curves"
(
Attention:
other distributions take
this shape too)
Aarhus University Biostatistics - Why? What? How? 28 / 38
Distributions Normality
Univariate Standard Normal/Gaussian Distribution
One of the most important distributions in natural sciences.
Used to represent real-valued
random variables whose
distributions are not known
The
central limit theorem
applies
(draw a sufficient number of
samples and you end up with the
normal distribution)
These distributions are usually
known also as "bell curves"
(
Attention:
other distributions take
this shape too)
Aarhus University Biostatistics - Why? What? How? 28 / 38
Distributions Normality
Testing For Normality
Testing for normality of the data is crucial for certain statistical
procedures.
The Shapiro-Wilks Test In Theory
Base assumption: The data is
normally distributed
If p-value < chosen significance
level, the data is not normally
distributed
Very sensitive to sample size
The QQ Plot In Theory
Method for comparing two
probability distributions by plotting
their quantiles against each other
If the two distributions being
compared are similar, the plot will
show the line y = x.
Compare the data distribution to
the normal distribution
Aarhus University Biostatistics - Why? What? How? 29 / 38
Distributions Normality
Testing For Normality
Testing for normality of the data is crucial for certain statistical
procedures.
The Shapiro-Wilks Test In Theory
Base assumption: The data is
normally distributed
If p-value < chosen significance
level, the data is not normally
distributed
Very sensitive to sample size
The QQ Plot In Theory
Method for comparing two
probability distributions by plotting
their quantiles against each other
If the two distributions being
compared are similar, the plot will
show the line y = x.
Compare the data distribution to
the normal distribution
Aarhus University Biostatistics - Why? What? How? 29 / 38
Distributions Normality
Testing For Normality
Testing for normality of the data is crucial for certain statistical
procedures.
The Shapiro-Wilks Test In Theory
Base assumption: The data is
normally distributed
If p-value < chosen significance
level, the data is not normally
distributed
Very sensitive to sample size
The QQ Plot In Theory
Method for comparing two
probability distributions by plotting
their quantiles against each other
If the two distributions being
compared are similar, the plot will
show the line y = x.
Compare the data distribution to
the normal distribution
Aarhus University Biostatistics - Why? What? How? 29 / 38
Distributions Normality
Testing For Normality
Testing for normality of the data is crucial for certain statistical
procedures.
The Shapiro-Wilks Test In Theory
Base assumption: The data is
normally distributed
If p-value < chosen significance
level, the data is not normally
distributed
Very sensitive to sample size
The QQ Plot In Theory
Method for comparing two
probability distributions by plotting
their quantiles against each other
If the two distributions being
compared are similar, the plot will
show the line y = x.
Compare the data distribution to
the normal distribution
Aarhus University Biostatistics - Why? What? How? 29 / 38
Distributions Normality
The Shapiro-Wilks Test In R
Using the shapiro.test() function:
shapiro.test(rnorm(5000, 20, 2))
##
## Shapiro-Wilk normality test
##
## data: rnorm(5000, 20, 2)
## W = 1, p-value = 0.7
Clearly a normal distributed set of values
shapiro.test(seq(1, 500, 5))
##
## Shapiro-Wilk normality test
##
## data: seq(1, 500, 5)
## W = 0.95, p-value = 0.002
Clearly no normal distributed set of values
For data sets bigger than 5000 data points, use the Kolmogorov-Smirnov test
(ks.test()) in R.
Aarhus University Biostatistics - Why? What? How? 30 / 38
Distributions Normality
The Shapiro-Wilks Test In R
Using the shapiro.test() function:
shapiro.test(rnorm(5000, 20, 2))
##
## Shapiro-Wilk normality test
##
## data: rnorm(5000, 20, 2)
## W = 1, p-value = 0.7
Clearly a normal distributed set of values
shapiro.test(seq(1, 500, 5))
##
## Shapiro-Wilk normality test
##
## data: seq(1, 500, 5)
## W = 0.95, p-value = 0.002
Clearly no normal distributed set of values
For data sets bigger than 5000 data points, use the Kolmogorov-Smirnov test
(ks.test()) in R.
Aarhus University Biostatistics - Why? What? How? 30 / 38
Distributions Normality
The Shapiro-Wilks Test In R
Using the shapiro.test() function:
shapiro.test(rnorm(5000, 20, 2))
##
## Shapiro-Wilk normality test
##
## data: rnorm(5000, 20, 2)
## W = 1, p-value = 0.7
Clearly a normal distributed set of values
shapiro.test(seq(1, 500, 5))
##
## Shapiro-Wilk normality test
##
## data: seq(1, 500, 5)
## W = 0.95, p-value = 0.002
Clearly no normal distributed set of values
For data sets bigger than 5000 data points, use the Kolmogorov-Smirnov test
(ks.test()) in R.
Aarhus University Biostatistics - Why? What? How? 30 / 38
Distributions Normality
The Shapiro-Wilks Test In R
Using the shapiro.test() function:
shapiro.test(rnorm(5000, 20, 2))
##
## Shapiro-Wilk normality test
##
## data: rnorm(5000, 20, 2)
## W = 1, p-value = 0.7
Clearly a normal distributed set of values
shapiro.test(seq(1, 500, 5))
##
## Shapiro-Wilk normality test
##
## data: seq(1, 500, 5)
## W = 0.95, p-value = 0.002
Clearly no normal distributed set of values
For data sets bigger than 5000 data points, use the Kolmogorov-Smirnov test
(ks.test()) in R.
Aarhus University Biostatistics - Why? What? How? 30 / 38
Distributions Normality
The Q-Q Plot
Using the qqnorm() function:
qqnorm(rnorm(5000,20,2))
−4 −2 0 2 4
15 20 25
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly a normal distributed set of values
qqnorm(seq(1,500,5))
−2 −1 0 1 2
0 100 200 300 400 500
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly no normal distributed set of values
Aarhus University Biostatistics - Why? What? How? 31 / 38
Distributions Normality
The Q-Q Plot
Using the qqnorm() function:
qqnorm(rnorm(5000,20,2))
−4 −2 0 2 4
15 20 25
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly a normal distributed set of values
qqnorm(seq(1,500,5))
−2 −1 0 1 2
0 100 200 300 400 500
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly no normal distributed set of values
Aarhus University Biostatistics - Why? What? How? 31 / 38
Distributions Normality
The Q-Q Plot
Using the qqnorm() function:
qqnorm(rnorm(5000,20,2))
−4 −2 0 2 4
15 20 25
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly a normal distributed set of values
qqnorm(seq(1,500,5))
−2 −1 0 1 2
0 100 200 300 400 500
Normal Q−Q Plot
Theoretical Quantiles
Sample Quantiles
Clearly no normal distributed set of values
Aarhus University Biostatistics - Why? What? How? 31 / 38
Distributions What Distributions To Consider
Binomial Distribution
One of the more important distributions. It is applicable to:
Variables which can only take two
possible values (e.g. “states”)
All records of the variable have the
same probability p of being in one
of the two states
It is made up of three criteria:
p - the "success" probability
n - sample size (how often we
sample)
N - the "binomial total" (for how
many individuals we sample each
time)
Binomial Distribution
n = 10000, N = 50, p = 0.6
Frequency
15 20 25 30 35 40
0 500 1000 1500 2000
Aarhus University Biostatistics - Why? What? How? 32 / 38
Distributions What Distributions To Consider
Binomial Distribution
One of the more important distributions. It is applicable to:
Variables which can only take two
possible values (e.g. “states”)
All records of the variable have the
same probability p of being in one
of the two states
It is made up of three criteria:
p - the "success" probability
n - sample size (how often we
sample)
N - the "binomial total" (for how
many individuals we sample each
time)
Binomial Distribution
n = 10000, N = 50, p = 0.6
Frequency
15 20 25 30 35 40
0 500 1000 1500 2000
Aarhus University Biostatistics - Why? What? How? 32 / 38
Distributions What Distributions To Consider
Binomial Distribution
One of the more important distributions. It is applicable to:
Variables which can only take two
possible values (e.g. “states”)
All records of the variable have the
same probability p of being in one
of the two states
It is made up of three criteria:
p - the "success" probability
n - sample size (how often we
sample)
N - the "binomial total" (for how
many individuals we sample each
time)
Binomial Distribution
n = 10000, N = 50, p = 0.6
Frequency
15 20 25 30 35 40
0 500 1000 1500 2000
Aarhus University Biostatistics - Why? What? How? 32 / 38
Distributions What Distributions To Consider
Binomial Distribution
One of the more important distributions. It is applicable to:
Variables which can only take two
possible values (e.g. “states”)
All records of the variable have the
same probability p of being in one
of the two states
It is made up of three criteria:
p - the "success" probability
n - sample size (how often we
sample)
N - the "binomial total" (for how
many individuals we sample each
time)
Binomial Distribution
n = 10000, N = 50, p = 0.6
Frequency
15 20 25 30 35 40
0 500 1000 1500 2000
Aarhus University Biostatistics - Why? What? How? 32 / 38
Distributions What Distributions To Consider
Poisson Distribution
Another one of the more important distributions. It is applicable to:
Focal objects are placed randomly
in one or more dimensions
A random “counting window”
(usually one considering time) is
placed above the sampling
scheme
It is made up of two criteria:
λ - the mean (= expectation,
average count, intensity) as well as
the variance (i.e., variance =
mean)
n - sample size
Poisson Distribution
n = 10000, Lambda = 5
Frequency
0 2 4 6 8 10 12
0 50 100 150
Aarhus University Biostatistics - Why? What? How? 33 / 38
Distributions What Distributions To Consider
Poisson Distribution
Another one of the more important distributions. It is applicable to:
Focal objects are placed randomly
in one or more dimensions
A random “counting window”
(usually one considering time) is
placed above the sampling
scheme
It is made up of two criteria:
λ - the mean (= expectation,
average count, intensity) as well as
the variance (i.e., variance =
mean)
n - sample size
Poisson Distribution
n = 10000, Lambda = 5
Frequency
0 2 4 6 8 10 12
0 50 100 150
Aarhus University Biostatistics - Why? What? How? 33 / 38
Distributions What Distributions To Consider
Poisson Distribution
Another one of the more important distributions. It is applicable to:
Focal objects are placed randomly
in one or more dimensions
A random “counting window”
(usually one considering time) is
placed above the sampling
scheme
It is made up of two criteria:
λ - the mean (= expectation,
average count, intensity) as well as
the variance (i.e., variance =
mean)
n - sample size
Poisson Distribution
n = 10000, Lambda = 5
Frequency
0 2 4 6 8 10 12
0 50 100 150
Aarhus University Biostatistics - Why? What? How? 33 / 38
Distributions What Distributions To Consider
Poisson Distribution
Another one of the more important distributions. It is applicable to:
Focal objects are placed randomly
in one or more dimensions
A random “counting window”
(usually one considering time) is
placed above the sampling
scheme
It is made up of two criteria:
λ - the mean (= expectation,
average count, intensity) as well as
the variance (i.e., variance =
mean)
n - sample size
Poisson Distribution
n = 10000, Lambda = 5
Frequency
0 2 4 6 8 10 12
0 50 100 150
Aarhus University Biostatistics - Why? What? How? 33 / 38
Distributions Important Measures Of Distributions
How to Measure Distributions
Not all distributions are created equally.
Distributions can be described via classic parameters of descriptive
statistics:
Arithmetic Mean
Mode
Median
Minimum, Maximum, Range
...
Variance
Standard Deviation
Quantile Range
Skewness
Kurtosis
...
Aarhus University Biostatistics - Why? What? How? 34 / 38
Distributions Important Measures Of Distributions
How to Measure Distributions
Not all distributions are created equally.
Distributions can be described via classic parameters of descriptive
statistics:
Arithmetic Mean
Mode
Median
Minimum, Maximum, Range
...
Variance
Standard Deviation
Quantile Range
Skewness
Kurtosis
...
Aarhus University Biostatistics - Why? What? How? 34 / 38
Distributions Important Measures Of Distributions
How to Measure Distributions
Not all distributions are created equally.
Distributions can be described via classic parameters of descriptive
statistics:
Arithmetic Mean
Mode
Median
Minimum, Maximum, Range
...
Variance
Standard Deviation
Quantile Range
Skewness
Kurtosis
...
Aarhus University Biostatistics - Why? What? How? 34 / 38
Distributions Important Measures Of Distributions
How to Measure Distributions
Not all distributions are created equally.
Distributions can be described via classic parameters of descriptive
statistics:
Arithmetic Mean
Mode
Median
Minimum, Maximum, Range
...
Variance
Standard Deviation
Quantile Range
Skewness
Kurtosis
...
Aarhus University Biostatistics - Why? What? How? 34 / 38
Distributions Important Measures Of Distributions
Skewness I
Definition:
Describes the symmetry and relative tail length of distributions.
Positive skew: Right-hand tail is longer than the left-hand tail
Skew = 0: Symmetric distribution
Negative skew: Left-hand tail is longer than the right-hand tail
Aarhus University Biostatistics - Why? What? How? 35 / 38
Distributions Important Measures Of Distributions
Skewness II
Positive Skew
Symmetric Distribution
Negative Skew
Aarhus University Biostatistics - Why? What? How? 36 / 38
Distributions Important Measures Of Distributions
Kurtosis I
Definition: Describes the evenness/"tailedness" of distributions.
Positive
kurtosis:
Short-tailed distribution aka. leptokurtic
Kurtosis = 0: Base representation of a given distribution aka. mesokurtic
Negative
kurtosis:
Long-tailed distribution aka. platykurtic
Aarhus University Biostatistics - Why? What? How? 37 / 38
Distributions Important Measures Of Distributions
Kurtosis II
Aarhus University Biostatistics - Why? What? How? 38 / 38