Descriptive Statistics

Theory

These are the solutions to the exercises contained within the handout to Descriptive Statistics which walks you through the basics of descriptive statistics and its parameters. The analyses presented here are using data from the StarWars data set supplied through the dplyr package that have been saved as a .csv file. Keep in mind that there is probably a myriad of other ways to reach the same conclusions as presented in these solutions. I have prepared some slides for this session:

Data

Find the data for this exercise here. Do not worry about downloading it for now.

Packages

As you will remember from our lecture slides, the calculation of the mode in R can either be achieved through some intense coding or simply by using the mlv(..., method="mfv") function contained within the modeest package (unfortunately, this package is out of date and can sometimes be challenging to install).

Conclusively, it is now time for you to get familiar with how packages work in R. Packages are the way by which R is supplied with user-created and moderator-mediated functionality that exceeds the base applicability of R. Many things you will want to accomplish in more advanced statistics is impossible without such packages and even basic tasks such as data visualisation (dealt with in our next seminar) are reliant on R packages.

If you want to get a package and its functions into R there are two ways we will discuss in the following. In general, it pays to load all packages at the beginning of a coding document before any actual analyses happen (in the preamble) so you get a good overview of what the program is calling upon.

Basic Preamble

This is the most basic version of getting packages into R and is widely practised and taught. Unsurprisingly, I am not a big fan of it.

First, you use function install.packages() to download the desired package off dedicated servers (usually CRAN-mirrors) to your machine where it is then unpacked into a library (a folder that’s located in your documents section by default). Secondly, you need to invoke the library() function to load the R package you need into your active R session. In our case of the package modeest it would look something like this:

install.packages("modeest")
library(modeest)

The reason I am not overly fond of this procedure is that it is clunky, can break easily through spelling mistakes and starts cluttering your preamble super fast if the analyses you are wanting to perform require excessive amounts of packages. Additionally, when you are some place with a bad internet connection you might not want to re-download packages that are already contained on your hard drive.

Advanced Preamble

There is a myriad of different preamble styles (just as there are tons of different, personalised coding styles). I am left with presenting my preamble of choice here but I do not claim that this is the most sophisticated one out there.

The way this preamble works is that it is structured around a user-defined function (something we will touch on later in our seminar series) which first checks whether a package is already downloaded and then installs (if necessary) and/or loads it into R. This function is called install.load.package() and you can see its specification down below (don’t worry if it doesn’t make sense to you yet - it is not supposed to at this point). Unfortunately, it can only ever be applied to one package at a time and so we need a workaround to make it work on multiple packages at once. This can be achieved by establishing a vector of all desired package names (package_vec) and then applying (sapply()) the install.load.package() function to every item of the package name vector iteratively as follows:

# function to load packages and install them if they haven't been installed yet
install.load.package <- function(x) {
    if (!require(x, character.only = TRUE))
        install.packages(x)
    require(x, character.only = TRUE)
}
# packages to load/install if necessary
package_vec <- c("modeest")
# applying function install.load.package to all packages specified in
# package_vec
sapply(package_vec, install.load.package)
## Loading required package: modeest
## modeest 
##    TRUE

Why do I prefer this? Firstly, it is way shorter than the basic method when dealing with many packages (which you will get into fast, I promise), reduces the chance for typos by 50% and does not override already installed packages hence speeding up your processing time.

Loading the Excel data into R

Our data is located in the Data folder and is called DescriptiveData.csv. Since it is a .csv file, we can simply use the R in-built function read.csv() to load the data by combining the former two identifiers into one long string with a backslash separating the two (the backslash indicates a step down in the folder hierarchy). Given this argument, read.csv() will produce an object of type data.frame in R which we want to keep in our environment and hence need to assign a name to. In our case, let that name be Data_df (I recommend using endings to your data object names that indicate their type for easier coding without constant type checking):

# Data_df <- read.csv('Data/DescriptiveData.csv') # load data file from Data
# folder if you downloaded the data as a .csv alternatively, read the csv
# directly from the url
Data_df <- read.csv("https://github.com/ErikKusch/Homepage/raw/master/content/courses/biostat101/Data/DescriptiveData.csv")

What’s contained within our data?

Now that our data set is finally loaded into R, we can finally get to trying to make sense of it. Usually, this shouldn’t ever be something one has to do in R but should be manageable through a project-/data-specific README file (we will cover this in our seminar on hypotheses testing and project planning) but for now we are stuck with pure exploration of our data set. Get your goggles on and let’s dive in!

Firstly, it always pays to asses the basic attributes of any data object (remember the Introduction to R seminar):

  • Name - we know the name (it is Data_df) since we named it that
  • Type - we already know that it is a data.frame because we created it using the read.csv function
  • Mode - this is an interesting one as it means having to subset our data frame
  • Dimensions - a crucial information about how many observations and variables are contained within our data set

Dimensions

Let’s start with the dimensions because these will tell us how many modes (these are object attribute modes and not descriptive parameter modes) to asses:

dim(Data_df)
## [1] 87  8

Using the dim() function, we arrive at the conclusion that our Data_df contains 87 rows and 8 columns. Since data frames are usually ordered as observations $\times$ variables, we can conclude that we have 87 observations and 8 variables at our hands.
You can arrive at the same point by using the function View() in R. I’m not showing this here because it does not translate well to paper and would make whoever chooses to print this waste paper.

Modes

Now it’s time to get a hang of the modes of the variable records within our data set. To do so, we have two choices, we can either subset the data frame by columns and apply the class() function to each column subset or simply apply the str() function to the data frame object. The reason str() may be favourable in this case is due to the fact that str() automatically breaks down the structure of R-internal objects and hence saves us the sub-setting:

str(Data_df)
## 'data.frame':	87 obs. of  8 variables:
##  $ name      : chr  "Luke Skywalker" "C-3PO" "R2-D2" "Darth Vader" ...
##  $ height    : int  172 167 96 202 150 178 165 97 183 182 ...
##  $ mass      : num  77 75 32 136 49 120 75 32 84 77 ...
##  $ hair_color: chr  "blond" "" "" "none" ...
##  $ skin_color: chr  "fair" "gold" "white, blue" "white" ...
##  $ eye_color : chr  "blue" "yellow" "red" "yellow" ...
##  $ birth_year: num  19 112 33 41.9 19 52 47 NA 24 57 ...
##  $ gender    : chr  "male" "" "" "male" ...

As it turns out, our data frame knows the 8 variables of name, height, mass, hair_color, skin_color, eye_color, birth_year, gender which range from integer to numeric and factor modes.

Data Content

So what does our data actually tell us? Answering this question usually comes down to some analyses but for now we are only really interested in what kind of information our data frame is storing.

Again, this would be easiest to asses using a README file or the View() function in R. However, for the sake of brevity we can make due with the following to commands which present the user with the first and last five rows of any respective data frame:

head(Data_df)
##             name height mass  hair_color  skin_color eye_color birth_year
## 1 Luke Skywalker    172   77       blond        fair      blue       19.0
## 2          C-3PO    167   75                    gold    yellow      112.0
## 3          R2-D2     96   32             white, blue       red       33.0
## 4    Darth Vader    202  136        none       white    yellow       41.9
## 5    Leia Organa    150   49       brown       light     brown       19.0
## 6      Owen Lars    178  120 brown, grey       light      blue       52.0
##   gender
## 1   male
## 2       
## 3       
## 4   male
## 5 female
## 6   male
tail(Data_df)
##                name height mass hair_color skin_color eye_color birth_year
## 82             Finn     NA   NA      black       dark      dark         NA
## 83              Rey     NA   NA      brown      light     hazel         NA
## 84      Poe Dameron     NA   NA      brown      light     brown         NA
## 85              BB8     NA   NA       none       none     black         NA
## 86   Captain Phasma     NA   NA    unknown    unknown   unknown         NA
## 87 Padm\xe9 Amidala    165   45      brown      light     brown         46
##    gender
## 82   male
## 83 female
## 84   male
## 85   none
## 86 female
## 87 female

The avid reader will surely have picked up on the fact that all the records in the name column of Data_df belong to characters from the Star Wars universe. In fact, this data set is a modified version of the StarWars data set supplied by the dplyr package and contains information of many of the important cast members of the Star Wars movie universe.

Parameters of descriptive statistics

Names

As it turns out (and should’ve been obvious from the onset if we’re honest), every major character in the cinematic Star Wars Universe has a unique name to themselves. Conclusively, the calculation of any parameters of descriptive statistics makes no sense with the names of our characters for the two following reasons:

  • The name variable is of mode character/factor and only allows for the calculation of the mode
  • Since every name only appears once, there is no mode

As long as the calculation of descriptive parameters of the name variable of our data set is concerned, Admiral Ackbar said it best: “It’s a trap”.

Height

Let’s get started on figuring out some parameters of descriptive statistics for the height variable of our Star Wars characters.

Subsetting

First, we need to extract the data in question from our big data frame object. This can be achieved by indexing through the column name as follows:

Height <- Data_df$height

Location Parameters

Now, with our Height vector being the numeric height records of the Star Wars characters in our data set, we are primed to calculate location parameters as follows:

mean <- mean(Height, na.rm = TRUE)
median <- median(Height, na.rm = TRUE)
mode <- mlv(na.omit(Height), method = "mfv")
min <- min(Height, na.rm = TRUE)
max <- max(Height, na.rm = TRUE)
range <- max - min

# Combining all location parameters into one vector for easier viewing
LocPars_vec <- c(mean, median, mode, min, max, range)
names(LocPars_vec) <- c("mean", "median", "mode", "minimum", "maximum", "range")
LocPars_vec
##    mean  median    mode minimum maximum   range 
## 174.358 180.000 183.000  66.000 264.000 198.000

As you can clearly see, there is a big range in the heights of our respective Star Wars characters with mean and median being fairly disjunct due to the outliers in height on especially either end.

Distribution Parameters

Now that we are aware of the location parameters of the Star Wars height records, we can move on to the distribution parameters/parameters of spread. Those can be calculated in R as follows:

var <- var(Height, na.rm = TRUE)
sd <- sd(Height, na.rm = TRUE)
quantiles <- quantile(Height, na.rm = TRUE)

# Combining all location parameters into one vector for easier viewing
DisPars_vec <- c(var, sd, quantiles)
names(DisPars_vec) <- c("var", "sd", "0%", "25%", "50%", "75%", "100%")
DisPars_vec
##        var         sd         0%        25%        50%        75%       100% 
## 1208.98272   34.77043   66.00000  167.00000  180.00000  191.00000  264.00000

Notice how some of the quantiles (actually quartiles in this case) link up with some of the parameters of central tendency.

Summary

Just to round this off, have a look at what the summary() function in R supplies you with:

summary <- summary(na.omit(Height))
summary
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    66.0   167.0   180.0   174.4   191.0   264.0

This is a nice assortment of location and dispersion parameters.

Mass

Now let’s focus on the weight of our Star Wars characters.

Subsetting

Again, we need to extract our data from the data frame. For the sake of brevity, I will refrain from showing you the rest of the analysis and only present its results to save some space.

Mass <- Data_df$mass

Location Parameters

##       mean     median       mode    minimum    maximum      range 
##   97.31186   79.00000   80.00000   15.00000 1358.00000 1343.00000

As you can see, there is a huge range in weight records of Star Wars characters and especially the outlier on the upper end (1358kg) push the mean towards the upper end of the weight range and away from the median. We’ve got Jabba Desilijic Tiure to thank for that.

Distribution Parameters

##        var         sd         0%        25%        50%        75%       100% 
## 28715.7300   169.4572    15.0000    55.6000    79.0000    84.5000  1358.0000

Quite obviously, the wide range of weight records also prompts a large variance and standard deviation.

Hair Color

Hair colour in our data set is saved in column 4 of our data set and so when sub-setting the data frame to obtain information about a characters hair colour, instead of calling on Data_df$hair_color we can also do so as follows:

HCs <- Data_df[, 4]

Of course, hair colour is not a numeric variable and much better represent by being of mode factor. Therefore, we are unable to obtain most parameters of descriptive statistics but we can show a frequency count as follows which allows for the calculation of the mode:

table(HCs)
## HCs
##                      auburn  auburn, grey auburn, white         black 
##             5             1             1             1            13 
##         blond        blonde         brown   brown, grey          grey 
##             3             1            18             1             1 
##          none       unknown         white 
##            37             1             4

Eye Colour

Eye colour is another factor mode variable:

ECs <- Data_df$eye_color

We can only calculate the mode by looking for the maximum in our table() output:

table(ECs)
## ECs
##         black          blue     blue-gray         brown          dark 
##            10            19             1            21             1 
##          gold green, yellow         hazel        orange          pink 
##             1             1             3             8             1 
##           red     red, blue       unknown         white        yellow 
##             5             1             3             1            11

Birth Year

Subsetting

As another numeric variable, birth year allows for the calculation of the full range of parameters of descriptive statistics:

BY <- Data_df$birth_year

Keep in mind that StarWars operates on a different time reference scale than we do.

Location Parameters

##      mean    median      mode   minimum   maximum     range 
##  87.56512  52.00000  19.00000   8.00000 896.00000 888.00000

Again, there is a big disparity here between mean and median which stems from extreme outliers on both ends of the age spectrum (Yoda and Wicket Systri Warrick, respectively).

Distribution Parameters

##        var         sd         0%        25%        50%        75%       100% 
## 23929.4414   154.6914     8.0000    35.0000    52.0000    72.0000   896.0000

Unsurprisingly, there is a big variance and standard deviation for the observed birth year/age records.

Gender

Gender is another factor mode variable (obviously):

Gender <- Data_df$gender

We can, again, only judge the mode of our data from the output of the table() function:

table(Gender)
## Gender
##                      female hermaphrodite          male          none 
##             3            19             1            62             2
Previous
Next