max(sapply(list(c(1,2,3), c(4,5,6), c(6,7,8)), mean))
[1] 7
Michał Wypych
Now that we have our dataset loaded we can finally get to work with it! We’ll start with the basics of data wrangling: subsetting datasets, sorting variables, changing variables and getting basic summaries. These are the standard things that you might want to do before any statistical modelling.
For data wrangling we’ll be working within tidyverse
throughout this course. Tidyverse
is a set of packages designed for working with data in a clean, readable way. A huge advantage (apart from readability) is that all packages in tidyverse
are designed to be compatible with each other and share common “grammar” and way of doing things. This way it is easy to combine them to do a lot of different things with your data. We’ve already met one package from this collection: readr
. Other packages include:
dplyr
: package for data wrangling. We’ll focus on it in this class
tidyr
: package for tidying data and reshaping it. We’ll look at it in the next class
ggplot2
: the go to package for data visualization in R. Absolutely the best of the best when it comes to plotting.
forcats
: package for working with factors
strings
: package for working with text data
purrr
: functional programming stuff in R like easier iteration within tidyverse
tibble
: package that introduces slightly altered data frames
Tidyverse
is not the only way in R for data wrangling (other package often used is data.table
, a new alternative is polars
). If you don’t want additional packages you don’t even need them, you can do almost everything in base R if you want to. So why choose tidyverse
? First of all it’s extremely intuitive. Writing and reading code in tidyverse
feels almost like writing plain text of what you want to do with your data. Thanks to pipes it also made code much more readable (more on that in a moment although there are now pipe alternatives). One downside of tidyverse
is that it is significantly slower than other packages. If speed is paramount you might want to consider switching to data.table
or polars
.
So far if we wanted to use multiple functions in a single call we had to wrap one function inside another e.g. if we wanted to take a list of vectors, calculate the mean of each vector and then find out the highest mean we could do something like this:
It’s not the easiest code to read, right? When combining functions this way you need to read them inside out. This is not how people read. It would e much easier if we could read code more linearly e.g. from left to right and top to bottom. Enter the pipe! The pipe operator allows you to chain together functions in a readable way. The basic idea (there’s more to pipes though) is to take what is on the left hand side of the pipe and pass it as the first argument of whatever is on the right hand side of the pipe. This changes the inside-out into left-to-right code. There are 2 pipes in R. The first one comes from the magrittr
package and this is the one used in tidyvese
. This pipe looks like this: %>%
. If we wanted to rewrite the code above using this pipe it would look like this:
It’s much easier to understand what this code does right? An alternative introduced in R 4.1 is the native pipe: |>
. The basic functionality is pretty much the same as in the magrittr
pipe but you don’t need to load any packages to use it (you might need to enable native pipe in the global options in Tools bar in RStudio). The same code as above but with native pipe looks like this:
You might wonder why have two kinds of pipes one of which needs loading a new package? The first reason is very simple: magrittr
pipe is older. There are however a few differences. You can read about the details here. Remember that pipe automatically passes what is on the left as the first argument to whatever is on the right of the pipe? What if you need to pass it not as the first but second or third argument? Both pipe operators have a placeholder argument that can be used in such situations. %>%
has the .
operator and |>
has _
. The difference between them is that _
can only be used once and has to be used with named arguments. Here’s an example of how placeholder argument can work: append()
allows you to join two vectors together. The vector passed as the second argument is appended to the one passed as the first argument:
Generally, the differences boil down to simplicity: native pipe was deliberately created to be a simpler operator with less functionality. Most of the time you won’t notice much difference (maybe except for how the placeholder argument works).
Now we can get to the basics of data wrangling in dplyr
package. We’ll look at the storms dataset in the dplyr
package. It stores information on date, place, status and some other things about storms from 1975 to 2021. The dataset stores multiple observations from each storm because measurements were made every few hours. Before we move one to working with data lets introduce one function: glimpse()
. It’s a bit like str()
but is a bit more readable for dataframes. This function can give you a concise look at what variables you have in your dataset. Lets load tidyverse
, our dataset and look at it:
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract() masks magrittr::extract()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 19,537
Columns: 13
$ name <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy",…
$ year <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975,…
$ month <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ day <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 2…
$ hour <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
$ lat <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3,…
$ long <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7,…
$ status <fct> tropical depression, tropical depression,…
$ category <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ wind <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 4…
$ pressure <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011,…
$ tropicalstorm_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ hurricane_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
The dataset has 19066 rows and 13 columns. We can also see that we have various types of variables: many numeric, one factor and one character.
You can subset a dataframe either by columns or by rows. If you want to extract a subset of rows based on some logical conditions you can use filter()
. Lets say we want only storms from 2020:
# A tibble: 863 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Arthur 2020 5 16 18 28 -78.7 tropical … NA 30 1008
2 Arthur 2020 5 17 0 28.9 -78 tropical … NA 35 1006
3 Arthur 2020 5 17 6 29.6 -77.6 tropical … NA 35 1004
4 Arthur 2020 5 17 12 30.3 -77.5 tropical … NA 35 1003
5 Arthur 2020 5 17 18 31 -77.3 tropical … NA 40 1003
6 Arthur 2020 5 18 0 31.9 -77 tropical … NA 40 1003
7 Arthur 2020 5 18 6 33.1 -76.7 tropical … NA 40 1002
8 Arthur 2020 5 18 12 34.4 -75.9 tropical … NA 45 1000
9 Arthur 2020 5 18 18 35.5 -74.7 tropical … NA 45 993
10 Arthur 2020 5 19 0 36.2 -73.1 tropical … NA 50 991
# ℹ 853 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
We can also filter based on multiple conditions. This works exactly like all the logical operations we’ve seen previously. One difference is that you can use ,
instead of &
. Lets say we want to get all storms from June 2020:
# A tibble: 57 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Cristobal 2020 6 1 18 19.4 -90.9 tropic… NA 25 1006
2 Cristobal 2020 6 2 0 19.6 -91.6 tropic… NA 25 1005
3 Cristobal 2020 6 2 6 19.6 -92.1 tropic… NA 30 1005
4 Cristobal 2020 6 2 12 19.5 -92.5 tropic… NA 35 1004
5 Cristobal 2020 6 2 18 19.2 -92.6 tropic… NA 40 1001
6 Cristobal 2020 6 3 0 19 -92.5 tropic… NA 45 996
7 Cristobal 2020 6 3 6 18.9 -92.3 tropic… NA 50 994
8 Cristobal 2020 6 3 12 18.8 -92.2 tropic… NA 50 993
9 Cristobal 2020 6 3 13 18.7 -92.1 tropic… NA 50 993
10 Cristobal 2020 6 3 18 18.5 -91.9 tropic… NA 45 994
# ℹ 47 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
There were 57 storms recorded in June 2020.
If you want to select only certain columns from a dataset you can use select()
. E.g. if we want only the name, latitude and longitude of the storm we can do it like this:
# A tibble: 19,537 × 3
name lat long
<chr> <dbl> <dbl>
1 Amy 27.5 -79
2 Amy 28.5 -79
3 Amy 29.5 -79
4 Amy 30.5 -79
5 Amy 31.5 -78.8
6 Amy 32.4 -78.7
7 Amy 33.3 -78
8 Amy 34 -77
9 Amy 34.4 -75.8
10 Amy 34 -74.8
# ℹ 19,527 more rows
If you want to select a range of columns you can use :
# A tibble: 19,537 × 4
year month day hour
<dbl> <dbl> <int> <dbl>
1 1975 6 27 0
2 1975 6 27 6
3 1975 6 27 12
4 1975 6 27 18
5 1975 6 28 0
6 1975 6 28 6
7 1975 6 28 12
8 1975 6 28 18
9 1975 6 29 0
10 1975 6 29 6
# ℹ 19,527 more rows
A particular situation in which you might want to subset a dataset is to get the rows with highest/lowest values of a variable, get the first/last rows or draw a random sample from the dataset. All of these can be achieved by different versions of slice()
. slice_sample()
will draw a random sample from the dataset (either by number or proportion). You can also specify if you want to draw with replacements:
# A tibble: 100 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Erin 2019 8 28 6 32.2 -72.4 tropica… NA 35 1002
2 Alberto 2018 5 30 0 35.4 -87.6 tropica… NA 30 998
3 Gladys 1975 9 23 0 10.6 -35.8 tropica… NA 25 1012
4 Alberto 2006 6 10 6 20 -85 tropica… NA 25 1004
5 Florence 1988 9 8 0 22.6 -89.6 tropica… NA 45 993
6 Arlene 2017 4 22 18 31.9 -46.4 extratr… NA 30 1001
7 Erin 1995 8 4 6 32.3 -89.1 tropica… NA 35 1001
8 Rafael 2012 10 17 12 36.5 -60.1 hurrica… 1 65 972
9 Katrina 2005 8 26 6 25.4 -81.3 hurrica… 1 65 987
10 Cindy 1993 8 14 12 14.1 -59.5 tropica… NA 30 1012
# ℹ 90 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
slice_min()
and slice_max()
allow you to get rows with highest values on a variable.
# A tibble: 24 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Allen 1980 8 7 18 21.8 -86.4 hurricane 5 165 899
2 Gilbert 1988 9 14 0 19.7 -83.8 hurricane 5 160 888
3 Wilma 2005 10 19 12 17.3 -82.8 hurricane 5 160 882
4 Dorian 2019 9 1 16 26.5 -77 hurricane 5 160 910
5 Dorian 2019 9 1 18 26.5 -77.1 hurricane 5 160 910
6 Allen 1980 8 5 12 15.9 -70.5 hurricane 5 155 932
7 Allen 1980 8 7 12 21 -84.8 hurricane 5 155 910
8 Allen 1980 8 8 0 22.2 -87.9 hurricane 5 155 920
9 Allen 1980 8 9 6 25 -94.2 hurricane 5 155 909
10 Gilbert 1988 9 14 6 19.9 -85.3 hurricane 5 155 889
# ℹ 14 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
Finally slice_head()
and slice_tail()
allow you to get n first or last rows from a dataset.
# A tibble: 5 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical de… NA 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical de… NA 25 1013
3 Amy 1975 6 27 12 29.5 -79 tropical de… NA 25 1013
4 Amy 1975 6 27 18 30.5 -79 tropical de… NA 25 1013
5 Amy 1975 6 28 0 31.5 -78.8 tropical de… NA 25 1012
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
One last thing about filtering. Sometimes you want to filter all unique values of a variable. In order to do it you can use distinct()
. It will extract all unique values of a variable. By default it will return only the column with distinct values and drop all the other columns from the dataframe. If you want to keep all the other variables (though remember that it will probably keep only the first entry for each unique value!) you can set the .keep_all
argument to TRUE
.
Sorting datasets based on variables is super simple. You can use the arrange()
function and if you need to sort in descending order use desc()
inside it. Lets say we want to find the storm with the strongest wind:
# A tibble: 19,537 × 13
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Allen 1980 8 7 18 21.8 -86.4 hurricane 5 165 899
2 Gilbert 1988 9 14 0 19.7 -83.8 hurricane 5 160 888
3 Wilma 2005 10 19 12 17.3 -82.8 hurricane 5 160 882
4 Dorian 2019 9 1 16 26.5 -77 hurricane 5 160 910
5 Dorian 2019 9 1 18 26.5 -77.1 hurricane 5 160 910
6 Allen 1980 8 5 12 15.9 -70.5 hurricane 5 155 932
7 Allen 1980 8 7 12 21 -84.8 hurricane 5 155 910
8 Allen 1980 8 8 0 22.2 -87.9 hurricane 5 155 920
9 Allen 1980 8 9 6 25 -94.2 hurricane 5 155 909
10 Gilbert 1988 9 14 6 19.9 -85.3 hurricane 5 155 889
# ℹ 19,527 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>
Looks like Allen from 1980 was the strongest storm.
Counting values in variables is very simple, just use count()
. One thing to remember is that count()
will returned a different dataframe. Unless you specify anything additional it will return a dataframe with 2 columns: one will contain all unique values of the variable you counted and the other one, named n
will contain their counts (you can specify the name
variable to change that to something else). Setting sort
argument to TRUE
will sort the counts in descending order. E.g. if we want to find out which year had the most measurements of storms (and not the number of storms! Remember that each row is 1 measurement of 1 storm) we can do it with:
A common task when wrangling data is creating new variables in an already existing dataset. You can do it by using mutate()
. Lets say we want to create a new variable that stores information on whether the storm was during summer (June, July, August) or not:
# A tibble: 19,537 × 14
name year month day hour lat long status category wind pressure
<chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct> <dbl> <int> <int>
1 Amy 1975 6 27 0 27.5 -79 tropical d… NA 25 1013
2 Amy 1975 6 27 6 28.5 -79 tropical d… NA 25 1013
3 Amy 1975 6 27 12 29.5 -79 tropical d… NA 25 1013
4 Amy 1975 6 27 18 30.5 -79 tropical d… NA 25 1013
5 Amy 1975 6 28 0 31.5 -78.8 tropical d… NA 25 1012
6 Amy 1975 6 28 6 32.4 -78.7 tropical d… NA 25 1012
7 Amy 1975 6 28 12 33.3 -78 tropical d… NA 25 1011
8 Amy 1975 6 28 18 34 -77 tropical d… NA 30 1006
9 Amy 1975 6 29 0 34.4 -75.8 tropical s… NA 35 1004
10 Amy 1975 6 29 6 34 -74.8 tropical s… NA 40 1002
# ℹ 19,527 more rows
# ℹ 3 more variables: tropicalstorm_force_diameter <int>,
# hurricane_force_diameter <int>, summer <lgl>
Remember that if you want to keep the variable you need to assign the new dataset to an object.
Another extremely common task is to get summaries about our dataset. We can do it with summarise()
function. e.g. what if we want to see the mean and standard deviation of strength of wind (for now across all measurements):
# A tibble: 1 × 2
mean_wind sd_wind
<dbl> <dbl>
1 50.0 25.5
Notice that the shape of the dataset has changed now. The columns are now the summaries and not the original variables.
So far we have been calculating things on entire datasets. In many situations you want to calculate something separately for each level of a categorical variable (much like tapply()
earlier). To group a dataset we can use group_by()
. You can also group by multiple variables at once by separating them by a coma. R will group by the first variable and then by the second etc. This is especially useful for creating summaries. E.g. if we want to get average wind speed for each storm we can easily do it:
# A tibble: 260 × 2
name mean_wind
<chr> <dbl>
1 AL011993 29.5
2 AL012000 25
3 AL021992 29
4 AL021994 24.2
5 AL021999 28.8
6 AL022000 29.2
7 AL022001 25
8 AL022003 30
9 AL022006 31.5
10 AL031987 21.2
# ℹ 250 more rows
Similarly if we want to calculate average wind speed from all measurements for each year and month:
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 259 × 3
# Groups: year [48]
year month mean_wind
<dbl> <dbl> <dbl>
1 1975 6 37.5
2 1975 7 49.7
3 1975 8 44.5
4 1975 9 55.9
5 1975 10 62.4
6 1976 8 55.9
7 1976 9 61.3
8 1976 10 51.0
9 1977 8 53
10 1977 9 50.6
# ℹ 249 more rows
Actually the summarise()
function has its own argument for calculating grouped summaries. You can specify .by
argument inside the summarise()
function. If you use group_by()
then summarise()
will by default drop the last level of grouping.
One more important thing about the group_by()
function is that it works by adding an attribute to the dataframe. This means that after a group_by()
all subsequent operations will be conducted on the grouped dataframe. E.g. if you sort after grouping then sorting will be conducted within each group separately. You can drop the grouping with ungroup()
. There is one last special kind of grouping you might want to use - sometimes you want to perform some operation separately on each row (e.g. calculate the average of a multi item scale for each participant). You can do it with rowwise()
.
There are situations when you want to perform the same operation on multiple columns (e.g. calculate the mean and standard deviation of multiple variables). You can do it by hand but this can be tedious. To simplify it you can use across()
inside mutate()
or summarise()
. the syntax of across()
is as follows: the first argument, .cols
specifies which columns to perform operations on. The second argument .fns
specifies which functions to apply. You have to provide it a list of functions (preferably named list) or a formula. Finally the .names
argument specifies how to automatically assign new variable names. E.g. “{.col}_{.fn}” will create variables with column name, underscore and function name (that’s why named list is useful here). If we want to get the mean and standard deviation of wind speed and pressure in each category of storms we could do it in a few lines of code (by the way, notice how group_by()
by default includes NA
as a separate category):
storms %>%
group_by(category) %>%
summarise(across(.cols = c("wind", "pressure"), .fns = list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
# A tibble: 6 × 5
category wind_mean wind_sd pressure_mean pressure_sd
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 71.0 5.55 981. 9.16
2 2 89.5 3.78 967. 8.92
3 3 104. 4.19 955. 8.88
4 4 122. 6.41 940. 9.34
5 5 146. 6.22 918. 12.1
6 NA 38.1 12.0 1002. 9.41
One more useful thing for data wrangling is a set of functions making it easier to select multiple columns based on some condition. There are a few helper functions you can use to do that. They do basically what their names suggest. These verbs are as follows: starts_with()
, ends_with()
and contains()
. This way you don’t have to manually type all the names if they have something in common (e.g. they are items from the same scale so they are named ``scale_1, scale_2 etc.). E.g. lets say we want to get all the columns that end with “diameter”:
# A tibble: 19,537 × 2
tropicalstorm_force_diameter hurricane_force_diameter
<int> <int>
1 NA NA
2 NA NA
3 NA NA
4 NA NA
5 NA NA
6 NA NA
7 NA NA
8 NA NA
9 NA NA
10 NA NA
# ℹ 19,527 more rows
Nice! These verbs also work nicely inside across()
. One thing to be aware of: if no column matches what you ask for you won’t get an error but a dataframe with all the rows but 0 columns:
Where tidyverse
really shines is in combining multiple functions together with pipes. Through different orders of the functions described above we can get a ton of things out of our dataset. This already gives us the ability to anwser a number of questions that might be very interesting for analysis.
Example 1: Lets say we want to find out the name of storm from each category that had the highest average pressure in 1989.
storms %>%
filter(year == 1989) %>%
group_by(category, name) %>%
summarise(mean_pressure = mean(pressure, na.rm = T)) %>%
slice_max(mean_pressure, n = 1)
`summarise()` has grouped output by 'category'. You can override using the
`.groups` argument.
# A tibble: 6 × 3
# Groups: category [6]
category name mean_pressure
<dbl> <chr> <dbl>
1 1 Chantal 987
2 2 Dean 972.
3 3 Hugo 953.
4 4 Gabrielle 943.
5 5 Hugo 918
6 NA Barry 1012.
Example 2: Lets say we want to find the average wind speed at each hour of the day but we want that in kilometers per hours rather than knots (as is in the database). 1 knot is around 1.852 km/h.
Throughout the exercises you’ll still work with the storms
dataset
Find out which year had the most storms (remember that each row is 1 measurement of 1 storm. You have to find the year with the most storms and not the most measurements!)
Find the average wind speed for each storm in 2018 and then sort from the highest to the lowest.
Calculate the mean and standard deviation of measurement pressure for each month of the year