CLASSIFICATIONS
Order from Chaos
Erik Kusch
erik.kusch@au.dk
Section for Ecoinformatics & Biodiversity
Center for Biodiversity and Dynamics in a Changing World (BIOCHANGE)
Aarhus University
18/11/2020
Aarhus University Biostatistics - Why? What? How? 1 / 35
1 Variables
Categorical Variables
Continuous Variables
Converting Variable Types
2 Classifications
Logistic Regression
K-Means
Hierarchies
Random Forests
Networks
Aarhus University Biostatistics - Why? What? How? 2 / 35
Variables
1 Variables
Categorical Variables
Continuous Variables
Converting Variable Types
2 Classifications
Logistic Regression
K-Means
Hierarchies
Random Forests
Networks
Aarhus University Biostatistics - Why? What? How? 3 / 35
Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 4 / 35
Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 4 / 35
Variables
Types of Variables
Variables can be classed into a multitude of types. The most common
classification system knows:
Categorical Variables
also known as Qualitative
Variables
Scales can be either:
Nominal
Ordinal
Continuous Variables
also known as Quantitative
Variables
Scales can be either:
Discrete
Continuous
Aarhus University Biostatistics - Why? What? How? 4 / 35
Variables Categorical Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 5 / 35
Variables Categorical Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 5 / 35
Variables Categorical Variables
Categorical Variables
Categorical variables are those variables which
establish and fall into
distinct groups and classes.
Categorical variables:
can take on a finite number of values
assign each unit of the population to one of a finite number of groups
can sometimes be ordered
In R, categorical variables usually come up as object type factor or
character.
Aarhus University Biostatistics - Why? What? How? 5 / 35
Variables Categorical Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 6 / 35
Variables Categorical Variables
Categorical Variables (Examples)
Examples of categorical variables:
Biome Classifications (e.g. "Boreal Forest", "Tundra", etc.)
Sex (e.g. "Male", "Female")
Hierarchy Position (e.g. "α-Individual", "β-Individual", etc.)
Soil Type (e.g. "Sandy", "Mud", "Permafrost", etc.)
Leaf Type (e.g. "Compound", "Single Blade", etc.)
Sexual Reproductive Stage (e.g. "Juvenile", "Mature", etc.)
Species Membership
Family Group Membership
...
Aarhus University Biostatistics - Why? What? How? 6 / 35
Variables Continuous Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 7 / 35
Variables Continuous Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 7 / 35
Variables Continuous Variables
Continuous Variables
Continuous variables are those variables which establish a range of
possible data values.
Continuous variables:
can take on an infinite number of values
can take on a new value for each unit in the set-up
can always be ordered
In R, continuous variables usually come up as object type numeric.
Aarhus University Biostatistics - Why? What? How? 7 / 35
Variables Continuous Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 8 / 35
Variables Continuous Variables
Continuous Variables (Examples)
Examples of categorical variables:
Temperature
Precipitation
Weight
pH
Altitude
Group Size
Vegetation Indices
Time
...
Aarhus University Biostatistics - Why? What? How? 8 / 35
Variables Converting Variable Types
Binning Variables
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 9 / 35
Variables Converting Variable Types
Binning Variables
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 9 / 35
Variables Converting Variable Types
Binning Variables
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 9 / 35
Variables Converting Variable Types
Binning Variables
Continuous variables can be converted into categorical variables via a method
called binning:
Given a variable range, one can establish however many “bins” as one wants.
For example:
Given a temperature range of
271K 291K
, there may be 4 bins of equal
size:
Bin A: 271K X 276K
Bin B: 276K < X 281K
Bin C: 281K < X 286K
Bin D: 286K < X 291K
Whilst a continuous variable can be both continuous and categorical,
a categorical variable can only ever be categorical!
Aarhus University Biostatistics - Why? What? How? 9 / 35
Variables Converting Variable Types
Confusion Of Units
Aarhus University Biostatistics - Why? What? How? 10 / 35
Classifications
1 Variables
Categorical Variables
Continuous Variables
Converting Variable Types
2 Classifications
Logistic Regression
K-Means
Hierarchies
Random Forests
Networks
Aarhus University Biostatistics - Why? What? How? 11 / 35
Classifications Logistic Regression
Theory
Logistic Regression
glm(..., family=binomial(link='logit')) in base R
Purpose: Understand how certain variables drive distinct outcomes.
Assumptions:
Down to Study-Design:
Variable values are independent (not paired)
Binar y logistic regression: response variable is binary
Ordinal logistic regression: response variable is ordinal
Need for Post-Hoc Tests:
Absence of influential outliers
Absence of multi-collinearity
Predictor Variables and log odds are related in a linear
fashion
Aarhus University Biostatistics - Why? What? How? 12 / 35
Classifications Logistic Regression
Theory
Logistic Regression
glm(..., family=binomial(link='logit')) in base R
Purpose: Understand how certain variables drive distinct outcomes.
Assumptions:
Down to Study-Design:
Variable values are independent (not paired)
Binar y logistic regression: response variable is binary
Ordinal logistic regression: response variable is ordinal
Need for Post-Hoc Tests:
Absence of influential outliers
Absence of multi-collinearity
Predictor Variables and log odds are related in a linear
fashion
Aarhus University Biostatistics - Why? What? How? 12 / 35
Classifications Logistic Regression
Example - The Data
library(titanic)
titanic_df <- na.omit(titanic_train) # remove NA rows
set.seed(42)
Rows <- sample(1:dim(titanic_df)[1], 50, replace = FALSE)
test_df <- titanic_df[Rows,c(2,3,5,6)] # 50 data for testing
train_df <- titanic_df[-Rows,c(2,3,5,6)] # remaining data for training
head(train_df)
## Survived Pclass Sex Age
## 1 0 3 male 22
## 2 1 1 female 38
## 4 1 1 female 35
## 5 0 3 male 35
## 7 0 1 male 54
## 8 0 3 male 2
Can we explain Survival (‘Survived‘) based on Passenger class (‘Pclass‘), sex (‘Sex‘),
and age (‘Age‘). Was it really "Women and children first"?
Aarhus University Biostatistics - Why? What? How? 13 / 35
Classifications Logistic Regression
Example - The Model
Logistic_Mod <- glm(Survived ~., # use all variables in the data frame
family = binomial(link = 'logit'), # logistic
data = train_df # where to take the data from
)
summary(Logistic_Mod)[["coefficients"]]
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.11787 0.519980 9.842 7.390e-23
## Pclass -1.29567 0.143615 -9.022 1.850e-19
## Sexmale -2.45459 0.214837 -11.425 3.123e-30
## Age -0.03867 0.007937 -4.872 1.105e-06
Logistic Regression Coefficients can’t be interpreted the same way as regular linear model
coefficients since we are interested in survival probabilities between 0 and 1.
Aarhus University Biostatistics - Why? What? How? 14 / 35
Classifications Logistic Regression
Example - Explanation & Prediction
Clearly, women of a young age in first class had the highest survival rate.
How do we know this? As class increases (from 1 to 3), survival probability decreases (-1.2957).
Furthermore, men (sexmale) had, on average, a much lower survival rate than women (-2.4546).
Lastly, increasing age negatively affected survival chances (-0.0387).
But how sure can we be of our model accuracy? We can test it by
predicting
some new data and
validating our predictions:
# predict on test data
fitted <- predict(Logistic_Mod, newdata=test_df, type='response')
# if predicted survival probability above .5 assume survival
fitted <- ifelse(fitted > 0.5 , 1, 0)
# compare actual data with predictions --> ERROR RATE
mean(fitted != test_df$Survived)
## [1] 0.2
In reality, one would fine-tune the probability at which to assume survivorship!
Aarhus University Biostatistics - Why? What? How? 15 / 35
Classifications K-Means
Theory
K-Means Clustering
Mclust() in mclust package
Purpose: Identify a number of k clusters in our data.
Assumptions:
Variance of the distribution of each variable is spherical
All variables have the same variance
Prior probability for all k clusters are the same
‘mclust‘ is capable of identifying the statistically most appropriate
number of clusters for the data set.
Aarhus University Biostatistics - Why? What? How? 16 / 35
Classifications K-Means
Theory
K-Means Clustering
Mclust() in mclust package
Purpose: Identify a number of k clusters in our data.
Assumptions:
Variance of the distribution of each variable is spherical
All variables have the same variance
Prior probability for all k clusters are the same
‘mclust‘ is capable of identifying the statistically most appropriate
number of clusters for the data set.
Aarhus University Biostatistics - Why? What? How? 16 / 35
Classifications K-Means
Example - The Data I
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Can we accurately identify the ‘Species‘ contained within the data set by clustering
according to ‘Sepal.Length‘, ‘Sepal.Width‘, ‘Petal.Length‘, and ‘Petal.Width‘?
Here, we decide to limit the number of clusters to the number of species present so we can test
how well the prediction went.
Aarhus University Biostatistics - Why? What? How? 17 / 35
Classifications K-Means
Example - The Data II
When building a training and test data set for identification of discrete values, we need to
identify data of each group in both data sets. We do so via stratified sampling.
library(splitstackshape) # access to the stratified function
set.seed(42) # make sampling reproducible
test_df <- stratified(indt = iris, # input data
group = "Species", # what the strata are
size = 7, # how many samples per strata
keep.rownames = TRUE) # keep the original rownames
training_df <- iris[-as.numeric(test_df$rn), ] # training data
Doing this assures that we have data for each group to build a classifier as well as test the
validity of our grouping.
Aarhus University Biostatistics - Why? What? How? 18 / 35
Classifications K-Means
Example - The Model I
library(mclust)
Mclust_mod <- Mclust(training_df[,-5], # data for the cluster model
G = length(unique(training_df[,5]))) # group number
plot(Mclust_mod, what = "uncertainty")
Sepal.Length
Sepal.Width
Sepal.Length
2.0 2.5 3.0 3.5 4.0
Petal.Length
Sepal.Length
Petal.Width
Sepal.Length
0.5 1.0 1.5 2.0 2.5
4.5 6.0 7.5
Sepal.Length
Sepal.Width
2.0 3.0 4.0
Sepal.Width
Petal.Length
Sepal.Width
Petal.Width
Sepal.Width
Sepal.Length
Petal.Length
Sepal.Width
Petal.Length
Petal.Length
Petal.Width
Petal.Length
1 3 5 7
Petal.Width
4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0
0.5 1.5 2.5
Petal.Width
Petal.Width
1 2 3 4 5 6 7
Petal.Width
Aarhus University Biostatistics - Why? What? How? 19 / 35
Classifications K-Means
Example - The Model II
Looking at the cluster centres and/or spreads can help with some biological
interpretation.
Mclust_means <- Mclust_mod[["parameters"]][["mean"]] # extract means
colnames(Mclust_means) <- unique(training_df$Species) # set columns
Mclust_means
## setosa versicolor virginica
## Sepal.Length 4.9907 6.052 6.696
## Sepal.Width 3.4302 2.811 2.974
## Petal.Length 1.4628 4.539 5.759
## Petal.Width 0.2535 1.521 2.024
I prefer a visualization as seen on the previous slide.
Aarhus University Biostatistics - Why? What? How? 20 / 35
Classifications K-Means
Example - Explanation & Prediction
Clearly, Petal.Length, and Petal.Width are extremely good separators for our different
clusters with the green and red clusters (versicolor and virginica) overlapping a lot in
Sepal.Length and Sepal.Width space.
But how sure can we be of our model accuracy? We can test it by predicting the cluster
membership and validating our predictions against the real data:
Mclust_pred <- predict.Mclust(Mclust_mod, test_df[,-c(1,6)]) # prediction
fitted <- Mclust_pred$classification # predicted species number
# compare actual data with predictions --> ERROR RATE
mean(fitted != as.numeric(test_df$Species))
## [1] 0.09524
Aarhus University Biostatistics - Why? What? How? 21 / 35
Classifications Hierarchies
Theory
Hierarchical Clustering
hclust() in base R or rpart() in rpart package and many others
Purpose: Build a decision tree for classification of our data.
Advantages:
Very easy to explain and interpret.
Easy to visualize.
Easily handle qualitative predictors without the need to
create dummy variables.
Disadvantages:
Very sensitive to the choice of linkage.
Generally do not have the same level of predictive
accuracy as some of the other regression and
classification approaches.
Trees can be very non-robust.
Aarhus University Biostatistics - Why? What? How? 22 / 35
Classifications Hierarchies
Theory
Hierarchical Clustering
hclust() in base R or rpart() in rpart package and many others
Purpose: Build a decision tree for classification of our data.
Advantages:
Very easy to explain and interpret.
Easy to visualize.
Easily handle qualitative predictors without the need to
create dummy variables.
Disadvantages:
Very sensitive to the choice of linkage.
Generally do not have the same level of predictive
accuracy as some of the other regression and
classification approaches.
Trees can be very non-robust.
Aarhus University Biostatistics - Why? What? How? 22 / 35
Classifications Hierarchies
Example - The Data I
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Again, let’s see if we can accurately identify the ‘Species‘ contained within the data set by
clustering according to ‘Sepal.Length‘, ‘Sepal.Width‘, ‘Petal.Length‘, and ‘Petal.Width‘.
Aarhus University Biostatistics - Why? What? How? 23 / 35
Classifications Hierarchies
Example - The Data II & Model I
‘hclust()‘ can only handle distance matrices.
We a distance matrix between the numeric components of our data like so:
dist_mat <- dist(iris[, -5])
A distance matrix stores information about the dissimilarity of different observations.
Now, we can build our initial model:
clusters <- hclust(dist_mat, method = "complete")
Aarhus University Biostatistics - Why? What? How? 24 / 35
Classifications Hierarchies
Example - The Data II & Model I
‘hclust()‘ can only handle distance matrices.
We a distance matrix between the numeric components of our data like so:
dist_mat <- dist(iris[, -5])
A distance matrix stores information about the dissimilarity of different observations.
Now, we can build our initial model:
clusters <- hclust(dist_mat, method = "complete")
Aarhus University Biostatistics - Why? What? How? 24 / 35
Classifications Hierarchies
Example - The Model II
par(mfrow = c(1,3))
plot(clusters, main = "complete")
plot(hclust(dist_mat, method = "single"), main = "single")
plot(hclust(dist_mat, method = "average"), main = "average")
108
131
103
126
130
119
106
123
118
132
110
136
141
145
125
121
144
101
137
149
116
111
148
113
140
142
146
109
104
117
138
105
129
133
150
71
128
139
115
122
114
102
143
135
112
147
124
127
73
84
134
120
69
88
66
76
77
55
59
78
87
51
53
86
52
57
75
98
74
79
64
92
61
99
58
94
107
67
85
56
91
62
72
68
83
93
95
100
89
96
97
63
65
80
60
54
90
70
81
82
42
30
31
26
10
35
13
2
46
36
5
38
28
29
41
1
18
50
8
40
23
7
43
3
4
48
14
9
39
17
33
34
15
16
6
19
21
32
37
11
49
45
47
20
22
44
24
27
12
25
0 2 4 6
complete
hclust (*, "complete")
dist_mat
Height
42
23
15
16
45
34
33
17
21
32
37
25
14
47
20
22
7
12
3
4
48
26
30
31
13
46
2
10
35
43
9
39
11
49
36
50
8
40
41
1
18
5
38
28
29
44
24
27
6
19
118
132
107
99
61
58
94
110
109
135
136
119
106
123
69
88
63
115
108
131
120
101
65
60
86
80
74
79
64
92
62
70
81
82
54
90
91
89
95
100
96
97
68
83
93
56
67
85
72
77
78
87
51
53
75
98
55
59
66
76
52
57
150
71
128
139
147
124
127
122
114
102
143
73
84
134
116
137
149
113
140
125
121
144
141
145
142
146
104
117
138
105
129
133
112
111
148
103
126
130
0.0 0.5 1.0 1.5
single
hclust (*, "single")
dist_mat
Height
42
15
16
33
34
37
21
32
44
24
27
36
5
38
50
8
40
28
29
41
1
18
45
6
19
17
11
49
47
20
22
23
14
43
9
39
12
25
7
13
2
46
26
10
35
30
31
3
4
48
105
129
133
112
104
117
138
111
148
113
140
142
146
116
137
149
101
125
121
144
141
145
109
135
110
118
132
119
106
123
136
108
131
103
126
130
61
99
58
94
66
76
55
59
78
77
87
51
53
86
52
57
74
79
64
92
72
75
98
120
69
88
115
122
114
102
143
150
71
128
139
147
124
127
73
84
134
107
63
68
83
93
62
95
100
89
96
97
67
85
56
91
65
80
60
54
90
70
81
82
0 1 2 3 4
average
hclust (*, "average")
dist_mat
Height
Aarhus University Biostatistics - Why? What? How? 25 / 35
Classifications Hierarchies
Example - Explanation & Prediction
Hierarchical clustering recognises as many groups as there are observation and we may
wish to prune the decision tree to a meaningful split level.
We know that we have three species in our data, so we may want to cut the complete tree at a
height of 3 (not because that’s the number of species, but because the tree just so happens to
recognize three clusters at that level of decision-making).
clusterCut <- cutree(clusters, 3) # cut tree
table(clusterCut, iris$Species) # assess fit
##
## clusterCut setosa versicolor virginica
## 1 50 0 0
## 2 0 23 49
## 3 0 27 1
As we can see here, our decision tree has had no issue identifying setosa and versicolor into
clusters 1 and 2 respectively. However, it is struggling with placing the species virginica.
Aarhus University Biostatistics - Why? What? How? 26 / 35
Classifications Hierarchies
Example - Decisions
So far we weren’t able to tell the actual decision rules of how to cluster our data. Let’s do this:
library(rpart)
fit <- rpart(Species ~. , data = iris)
plot(fit, margin = .1); text(fit, use.n = TRUE)
|
Petal.Length< 2.45
Petal.Width< 1.75
setosa
5e+01/0/0
versicolor
0/5e+01/5
virginica
0/1/4e+01
We can tell that our decisions for assigning species membership build on Petal.Length and
Petal.Width in this example (remember the K-mean clustering)!
Aarhus University Biostatistics - Why? What? How? 27 / 35
Classifications Hierarchies
Example - Decisions
So far we weren’t able to tell the actual decision rules of how to cluster our data. Let’s do this:
library(rpart)
fit <- rpart(Species ~. , data = iris)
plot(fit, margin = .1); text(fit, use.n = TRUE)
|
Petal.Length< 2.45
Petal.Width< 1.75
setosa
5e+01/0/0
versicolor
0/5e+01/5
virginica
0/1/4e+01
We can tell that our decisions for assigning species membership build on Petal.Length and
Petal.Width in this example (remember the K-mean clustering)!
Aarhus University Biostatistics - Why? What? How? 27 / 35
Classifications Random Forests
Theory
Random Forests
tuneRF() in randomForest package
Purpose:
Identify which variables to use for clustering our data and
build a tree.
Advantages:
Extremely powerful.
Very robust.
Easy to interpret.
Disadvantages:
A black box algorithm.
Computationally expensive.
Aarhus University Biostatistics - Why? What? How? 28 / 35
Classifications Random Forests
Theory
Random Forests
tuneRF() in randomForest package
Purpose:
Identify which variables to use for clustering our data and
build a tree.
Advantages:
Extremely powerful.
Very robust.
Easy to interpret.
Disadvantages:
A black box algorithm.
Computationally expensive.
Aarhus University Biostatistics - Why? What? How? 28 / 35
Classifications Random Forests
Example - The Data
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
One final time, we ask whether we can accurately identify the ‘Species‘ contained within
the data set by clustering according to ‘Sepal.Length‘, ‘Sepal.Width‘, ‘Petal.Length‘, and
‘Petal.Width‘.
Aarhus University Biostatistics - Why? What? How? 29 / 35
Classifications Random Forests
Example - The Model
library(randomForest)
set.seed(42) # set seed because the process is random
RF_Mod <- tuneRF(x = iris[,-5], # variables which to use for clustering
y = iris[,5], # correct cluster assignment
strata = iris[,5], # stratified sampling
doBest = TRUE, # run the best overall tree
ntreeTry = 20000, # consider this number of trees
improve = 0.0001, # improvement if this is exceeded
trace = FALSE, plot = FALSE)
## -0.1429 0.0001
## 0 0.0001
RF_Mod[["confusion"]]
## setosa versicolor virginica class.error
## setosa 50 0 0 0.00
## versicolor 0 47 3 0.06
## virginica 0 3 47 0.06
Aarhus University Biostatistics - Why? What? How? 30 / 35
Classifications Random Forests
Example - Explanation
That is one stunningly accurate classification!
Let’s see which variables where actually the most useful when making our clustering decisions:
varImpPlot(RF_Mod)
Sepal.Width
Sepal.Length
Petal.Length
Petal.Width
0 10 20 30 40
RF_Mod
MeanDecreaseGini
Aarhus University Biostatistics - Why? What? How? 31 / 35
Classifications Networks
Theory
Network Clustering
cluster_optimal(), etc. in igraph package and many others
Purpose:
Identify compartments which are strongly connected within,
but not between each other.
Advantages:
Highly flexible approaches.
Network analyses offer much more than clustering.
Allow for clustering of very different data and
identification relationships than other approaches.
Disadvantages:
Steep learning curve.
Tricky in formatting data correctly.
Choices can become overwhelming
Aarhus University Biostatistics - Why? What? How? 32 / 35
Classifications Networks
Theory
Network Clustering
cluster_optimal(), etc. in igraph package and many others
Purpose:
Identify compartments which are strongly connected within,
but not between each other.
Advantages:
Highly flexible approaches.
Network analyses offer much more than clustering.
Allow for clustering of very different data and
identification relationships than other approaches.
Disadvantages:
Steep learning curve.
Tricky in formatting data correctly.
Choices can become overwhelming
Aarhus University Biostatistics - Why? What? How? 32 / 35
Classifications Networks
Example - The Data
Here, we take a foodweb contained within the foodwebs data collection of the igraphdata
package. We are using the Middle Chesapeake Bay in Summer foodweb (Hagy, J.D. (2002) Eutrophication,
hypoxia and trophic transfer efficiency in Chesa-peake Bay PhD Dissertation, University of Maryland at College Park (USA), 446 pp. ).
library(igraph)
library(igraphdata)
data("foodwebs")
Foodweb_ig <- foodwebs[[2]]
Let’s see what kind of network-internal clusters we can make out.
Aarhus University Biostatistics - Why? What? How? 33 / 35
Classifications Networks
Example - A Directed Network
A
directed network
is one in which we
know which node/vertex is acting one
which other node/vertex.
We identify the clusters as follows:
Clusters <- cluster_optimal(Foodweb_ig)
Colours <- Clusters$membership
Colours <- rainbow(max(Colours))[Colours]
plot(Foodweb_ig,
vertex.color = Colours,
vertex.size = degree(Foodweb_ig)
*
0.5,
layout=layout.grid, edge.arrow.size=0.001)
This identifies sub-networks/clusters by
optimizing the modularity score of the
overall network (i.e. optimizing
connections within vs. between
clusters).
Net PhytoplanktonPicoplankton
Free BacteriaParticle Attached Bacteria
Heteroflagellates
Ciliates
Rotifers
MeroplanktonMesozooplanktonCtenophores ChrysaoraMicrophytobenthos
SAV
Benthic Bacteria
Meiofauna
Deposit Feeding BenthosSuspension Feeding Benthos
Oysters
Blue Crab
Menhaden
Bay anchovy
Herrings and Shads
White Perch
Spot
Croaker
Hogchoker
American eel
Catfish
Striped Bass
Bluefish Weakfish
DOC
Sediment POC
POC
Input
Output
Respiration
Aarhus University Biostatistics - Why? What? How? 34 / 35
Classifications Networks
Example - An Undirected Network
An undirected network is one in
which we don’t know which node/vertex
is acting one which other node/vertex.
We identify the clusters as follows
(there are more options):
Foodweb_ig <- as.undirected(Foodweb_ig)
Clusters <- cluster_fast_greedy(Foodweb_ig)
Colours <- Clusters$membership
Colours <- rainbow(max(Colours))[Colours]
plot(Foodweb_ig,
vertex.color = Colours,
vertex.size = degree(Foodweb_ig)
*
0.5,
layout=layout.grid, edge.arrow.size=0.001)
This identifies sub-networks/clusters by
optimizing the modularity score of the
overall network (i.e. optimizing
connections within vs. between
clusters).
Net PhytoplanktonPicoplankton
Free BacteriaParticle Attached Bacteria
Heteroflagellates
Ciliates
Rotifers
MeroplanktonMesozooplanktonCtenophores ChrysaoraMicrophytobenthos
SAV
Benthic Bacteria
Meiofauna
Deposit Feeding BenthosSuspension Feeding Benthos
Oysters
Blue Crab
Menhaden
Bay anchovy
Herrings and Shads
White Perch
Spot
Croaker
Hogchoker
American eel
Catfish
Striped Bass
Bluefish Weakfish
DOC
Sediment POC
POC
Input
Output
Respiration
Aarhus University Biostatistics - Why? What? How? 35 / 35