r_website
  • Home
  • Basics
    • Introduction
    • Basic objects
    • Data types
    • Loops and conditionals
    • Functions
  • Working with data
    • Loading data
    • Data wrangling
    • Joining and restructuring data
    • Vizualization vol 1
    • Vizualization vol 2
  • Analyzing data
    • Exploratory analysis
    • Simple regression
    • Statistical control
    • What next

On this page

  • Data wrangling
    • Tidyverse
    • The pipe
    • The basic dplyr function
      • Subsetting
      • Sorting
      • Counting
      • Changing variables
      • Summaries
      • Grouping rows
      • calculating across columns
      • Helper verbs
    • Combining functions together
    • Exercises

data wrangling

Author

Michał Wypych

Data wrangling

Now that we have our dataset loaded we can finally get to work with it! We’ll start with the basics of data wrangling: subsetting datasets, sorting variables, changing variables and getting basic summaries. These are the standard things that you might want to do before any statistical modelling.

Tidyverse

For data wrangling we’ll be working within tidyverse throughout this course. Tidyverse is a set of packages designed for working with data in a clean, readable way. A huge advantage (apart from readability) is that all packages in tidyverse are designed to be compatible with each other and share common “grammar” and way of doing things. This way it is easy to combine them to do a lot of different things with your data. We’ve already met one package from this collection: readr. Other packages include:

  • dplyr: package for data wrangling. We’ll focus on it in this class

  • tidyr: package for tidying data and reshaping it. We’ll look at it in the next class

  • ggplot2: the go to package for data visualization in R. Absolutely the best of the best when it comes to plotting.

  • forcats: package for working with factors

  • strings: package for working with text data

  • purrr: functional programming stuff in R like easier iteration within tidyverse

  • tibble: package that introduces slightly altered data frames

Tidyverse is not the only way in R for data wrangling (other package often used is data.table, a new alternative is polars). If you don’t want additional packages you don’t even need them, you can do almost everything in base R if you want to. So why choose tidyverse? First of all it’s extremely intuitive. Writing and reading code in tidyverse feels almost like writing plain text of what you want to do with your data. Thanks to pipes it also made code much more readable (more on that in a moment although there are now pipe alternatives). One downside of tidyverse is that it is significantly slower than other packages. If speed is paramount you might want to consider switching to data.table or polars.

The pipe

So far if we wanted to use multiple functions in a single call we had to wrap one function inside another e.g. if we wanted to take a list of vectors, calculate the mean of each vector and then find out the highest mean we could do something like this:

max(sapply(list(c(1,2,3), c(4,5,6), c(6,7,8)), mean))
[1] 7

It’s not the easiest code to read, right? When combining functions this way you need to read them inside out. This is not how people read. It would e much easier if we could read code more linearly e.g. from left to right and top to bottom. Enter the pipe! The pipe operator allows you to chain together functions in a readable way. The basic idea (there’s more to pipes though) is to take what is on the left hand side of the pipe and pass it as the first argument of whatever is on the right hand side of the pipe. This changes the inside-out into left-to-right code. There are 2 pipes in R. The first one comes from the magrittr package and this is the one used in tidyvese. This pipe looks like this: %>%. If we wanted to rewrite the code above using this pipe it would look like this:

library(magrittr)

list(c(1,2,3), c(4,5,6), c(6,7,8)) %>%
  sapply(mean) %>%
  max()
[1] 7

It’s much easier to understand what this code does right? An alternative introduced in R 4.1 is the native pipe: |>. The basic functionality is pretty much the same as in the magrittr pipe but you don’t need to load any packages to use it (you might need to enable native pipe in the global options in Tools bar in RStudio). The same code as above but with native pipe looks like this:

list(c(1,2,3), c(4,5,6), c(6,7,8)) |>
  sapply(mean) |>
  max()
[1] 7

You might wonder why have two kinds of pipes one of which needs loading a new package? The first reason is very simple: magrittr pipe is older. There are however a few differences. You can read about the details here. Remember that pipe automatically passes what is on the left as the first argument to whatever is on the right of the pipe? What if you need to pass it not as the first but second or third argument? Both pipe operators have a placeholder argument that can be used in such situations. %>% has the . operator and |> has _. The difference between them is that _ can only be used once and has to be used with named arguments. Here’s an example of how placeholder argument can work: append() allows you to join two vectors together. The vector passed as the second argument is appended to the one passed as the first argument:

x <- c(1,2,3)
y <- c(4,5,6)
x %>%
  append(y, .)
[1] 4 5 6 1 2 3

Generally, the differences boil down to simplicity: native pipe was deliberately created to be a simpler operator with less functionality. Most of the time you won’t notice much difference (maybe except for how the placeholder argument works).

The basic dplyr function

Now we can get to the basics of data wrangling in dplyr package. We’ll look at the storms dataset in the dplyr package. It stores information on date, place, status and some other things about storms from 1975 to 2021. The dataset stores multiple observations from each storm because measurements were made every few hours. Before we move one to working with data lets introduce one function: glimpse(). It’s a bit like str() but is a bit more readable for dataframes. This function can give you a concise look at what variables you have in your dataset. Lets load tidyverse, our dataset and look at it:

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::extract()   masks magrittr::extract()
✖ dplyr::filter()    masks stats::filter()
✖ dplyr::lag()       masks stats::lag()
✖ purrr::set_names() masks magrittr::set_names()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data("storms")

glimpse(storms)
Rows: 19,537
Columns: 13
$ name                         <chr> "Amy", "Amy", "Amy", "Amy", "Amy", "Amy",…
$ year                         <dbl> 1975, 1975, 1975, 1975, 1975, 1975, 1975,…
$ month                        <dbl> 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6,…
$ day                          <int> 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 2…
$ hour                         <dbl> 0, 6, 12, 18, 0, 6, 12, 18, 0, 6, 12, 18,…
$ lat                          <dbl> 27.5, 28.5, 29.5, 30.5, 31.5, 32.4, 33.3,…
$ long                         <dbl> -79.0, -79.0, -79.0, -79.0, -78.8, -78.7,…
$ status                       <fct> tropical depression, tropical depression,…
$ category                     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ wind                         <int> 25, 25, 25, 25, 25, 25, 25, 30, 35, 40, 4…
$ pressure                     <int> 1013, 1013, 1013, 1013, 1012, 1012, 1011,…
$ tropicalstorm_force_diameter <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ hurricane_force_diameter     <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

The dataset has 19066 rows and 13 columns. We can also see that we have various types of variables: many numeric, one factor and one character.

Subsetting

You can subset a dataframe either by columns or by rows. If you want to extract a subset of rows based on some logical conditions you can use filter(). Lets say we want only storms from 2020:

storms %>%
  filter(year == 2020)
# A tibble: 863 × 13
   name    year month   day  hour   lat  long status     category  wind pressure
   <chr>  <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>         <dbl> <int>    <int>
 1 Arthur  2020     5    16    18  28   -78.7 tropical …       NA    30     1008
 2 Arthur  2020     5    17     0  28.9 -78   tropical …       NA    35     1006
 3 Arthur  2020     5    17     6  29.6 -77.6 tropical …       NA    35     1004
 4 Arthur  2020     5    17    12  30.3 -77.5 tropical …       NA    35     1003
 5 Arthur  2020     5    17    18  31   -77.3 tropical …       NA    40     1003
 6 Arthur  2020     5    18     0  31.9 -77   tropical …       NA    40     1003
 7 Arthur  2020     5    18     6  33.1 -76.7 tropical …       NA    40     1002
 8 Arthur  2020     5    18    12  34.4 -75.9 tropical …       NA    45     1000
 9 Arthur  2020     5    18    18  35.5 -74.7 tropical …       NA    45      993
10 Arthur  2020     5    19     0  36.2 -73.1 tropical …       NA    50      991
# ℹ 853 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

We can also filter based on multiple conditions. This works exactly like all the logical operations we’ve seen previously. One difference is that you can use , instead of &. Lets say we want to get all storms from June 2020:

storms %>%
  filter(year == 2020, month == 6)
# A tibble: 57 × 13
   name       year month   day  hour   lat  long status  category  wind pressure
   <chr>     <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>      <dbl> <int>    <int>
 1 Cristobal  2020     6     1    18  19.4 -90.9 tropic…       NA    25     1006
 2 Cristobal  2020     6     2     0  19.6 -91.6 tropic…       NA    25     1005
 3 Cristobal  2020     6     2     6  19.6 -92.1 tropic…       NA    30     1005
 4 Cristobal  2020     6     2    12  19.5 -92.5 tropic…       NA    35     1004
 5 Cristobal  2020     6     2    18  19.2 -92.6 tropic…       NA    40     1001
 6 Cristobal  2020     6     3     0  19   -92.5 tropic…       NA    45      996
 7 Cristobal  2020     6     3     6  18.9 -92.3 tropic…       NA    50      994
 8 Cristobal  2020     6     3    12  18.8 -92.2 tropic…       NA    50      993
 9 Cristobal  2020     6     3    13  18.7 -92.1 tropic…       NA    50      993
10 Cristobal  2020     6     3    18  18.5 -91.9 tropic…       NA    45      994
# ℹ 47 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

There were 57 storms recorded in June 2020.

If you want to select only certain columns from a dataset you can use select(). E.g. if we want only the name, latitude and longitude of the storm we can do it like this:

storms %>%
  select(name, lat, long)
# A tibble: 19,537 × 3
   name    lat  long
   <chr> <dbl> <dbl>
 1 Amy    27.5 -79  
 2 Amy    28.5 -79  
 3 Amy    29.5 -79  
 4 Amy    30.5 -79  
 5 Amy    31.5 -78.8
 6 Amy    32.4 -78.7
 7 Amy    33.3 -78  
 8 Amy    34   -77  
 9 Amy    34.4 -75.8
10 Amy    34   -74.8
# ℹ 19,527 more rows

If you want to select a range of columns you can use :

storms %>%
  select(year:hour)
# A tibble: 19,537 × 4
    year month   day  hour
   <dbl> <dbl> <int> <dbl>
 1  1975     6    27     0
 2  1975     6    27     6
 3  1975     6    27    12
 4  1975     6    27    18
 5  1975     6    28     0
 6  1975     6    28     6
 7  1975     6    28    12
 8  1975     6    28    18
 9  1975     6    29     0
10  1975     6    29     6
# ℹ 19,527 more rows

A particular situation in which you might want to subset a dataset is to get the rows with highest/lowest values of a variable, get the first/last rows or draw a random sample from the dataset. All of these can be achieved by different versions of slice(). slice_sample() will draw a random sample from the dataset (either by number or proportion). You can also specify if you want to draw with replacements:

storms %>%
  slice_sample(n = 100)
# A tibble: 100 × 13
   name      year month   day  hour   lat  long status   category  wind pressure
   <chr>    <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>       <dbl> <int>    <int>
 1 Chantal   2001     8    20     6  17.1 -84.1 tropica…       NA    50     1008
 2 Igor      2010     9    22    18  57   -50   extratr…       NA    65      955
 3 Hortense  1996     9    12    12  23.9 -71.8 hurrica…        4   115      959
 4 Chantal   2013     7    10    12  16.7 -71.2 tropica…       NA    40     1011
 5 Nadine    2012     9    13    12  22   -51.7 tropica…       NA    60      990
 6 Vicky     2020     9    18     6  20.5 -40.2 other l…       NA    25     1009
 7 Jose      2017     9    19     0  34.9 -71.4 hurrica…        1    65      972
 8 Chantal   2019     8    22     0  39.6 -48.3 tropica…       NA    30     1007
 9 Danielle  1998     8    27    18  21.8 -60.1 hurrica…        2    85      994
10 Ian       2022     9    30    18  33.3 -79.2 hurrica…        1    70      978
# ℹ 90 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

slice_min() and slice_max() allow you to get rows with highest values on a variable.

storms %>%
  slice_max(wind, n = 10)
# A tibble: 24 × 13
   name     year month   day  hour   lat  long status    category  wind pressure
   <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>        <dbl> <int>    <int>
 1 Allen    1980     8     7    18  21.8 -86.4 hurricane        5   165      899
 2 Gilbert  1988     9    14     0  19.7 -83.8 hurricane        5   160      888
 3 Wilma    2005    10    19    12  17.3 -82.8 hurricane        5   160      882
 4 Dorian   2019     9     1    16  26.5 -77   hurricane        5   160      910
 5 Dorian   2019     9     1    18  26.5 -77.1 hurricane        5   160      910
 6 Allen    1980     8     5    12  15.9 -70.5 hurricane        5   155      932
 7 Allen    1980     8     7    12  21   -84.8 hurricane        5   155      910
 8 Allen    1980     8     8     0  22.2 -87.9 hurricane        5   155      920
 9 Allen    1980     8     9     6  25   -94.2 hurricane        5   155      909
10 Gilbert  1988     9    14     6  19.9 -85.3 hurricane        5   155      889
# ℹ 14 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

Finally slice_head() and slice_tail() allow you to get n first or last rows from a dataset.

storms %>%
  slice_head(n = 5)
# A tibble: 5 × 13
  name   year month   day  hour   lat  long status       category  wind pressure
  <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>           <dbl> <int>    <int>
1 Amy    1975     6    27     0  27.5 -79   tropical de…       NA    25     1013
2 Amy    1975     6    27     6  28.5 -79   tropical de…       NA    25     1013
3 Amy    1975     6    27    12  29.5 -79   tropical de…       NA    25     1013
4 Amy    1975     6    27    18  30.5 -79   tropical de…       NA    25     1013
5 Amy    1975     6    28     0  31.5 -78.8 tropical de…       NA    25     1012
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

One last thing about filtering. Sometimes you want to filter all unique values of a variable. In order to do it you can use distinct(). It will extract all unique values of a variable. By default it will return only the column with distinct values and drop all the other columns from the dataframe. If you want to keep all the other variables (though remember that it will probably keep only the first entry for each unique value!) you can set the .keep_all argument to TRUE.

storms %>%
  distinct(category)
# A tibble: 6 × 1
  category
     <dbl>
1       NA
2        1
3        3
4        2
5        4
6        5

Sorting

Sorting datasets based on variables is super simple. You can use the arrange() function and if you need to sort in descending order use desc() inside it. Lets say we want to find the storm with the strongest wind:

storms %>%
  arrange(desc(wind))
# A tibble: 19,537 × 13
   name     year month   day  hour   lat  long status    category  wind pressure
   <chr>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>        <dbl> <int>    <int>
 1 Allen    1980     8     7    18  21.8 -86.4 hurricane        5   165      899
 2 Gilbert  1988     9    14     0  19.7 -83.8 hurricane        5   160      888
 3 Wilma    2005    10    19    12  17.3 -82.8 hurricane        5   160      882
 4 Dorian   2019     9     1    16  26.5 -77   hurricane        5   160      910
 5 Dorian   2019     9     1    18  26.5 -77.1 hurricane        5   160      910
 6 Allen    1980     8     5    12  15.9 -70.5 hurricane        5   155      932
 7 Allen    1980     8     7    12  21   -84.8 hurricane        5   155      910
 8 Allen    1980     8     8     0  22.2 -87.9 hurricane        5   155      920
 9 Allen    1980     8     9     6  25   -94.2 hurricane        5   155      909
10 Gilbert  1988     9    14     6  19.9 -85.3 hurricane        5   155      889
# ℹ 19,527 more rows
# ℹ 2 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>

Looks like Allen from 1980 was the strongest storm.

Counting

Counting values in variables is very simple, just use count(). One thing to remember is that count() will returned a different dataframe. Unless you specify anything additional it will return a dataframe with 2 columns: one will contain all unique values of the variable you counted and the other one, named n will contain their counts (you can specify the name variable to change that to something else). Setting sort argument to TRUE will sort the counts in descending order. E.g. if we want to find out which year had the most measurements of storms (and not the number of storms! Remember that each row is 1 measurement of 1 storm) we can do it with:

storms %>%
  count(year, sort = TRUE)
# A tibble: 48 × 2
    year     n
   <dbl> <int>
 1  2005   873
 2  2020   863
 3  1995   762
 4  2010   663
 5  2012   654
 6  2017   610
 7  2018   608
 8  2004   604
 9  2003   593
10  2021   592
# ℹ 38 more rows

Changing variables

A common task when wrangling data is creating new variables in an already existing dataset. You can do it by using mutate(). Lets say we want to create a new variable that stores information on whether the storm was during summer (June, July, August) or not:

storms %>%
  mutate(summer = ifelse(month == 6 | month == 7 | month == 8, TRUE, FALSE))
# A tibble: 19,537 × 14
   name   year month   day  hour   lat  long status      category  wind pressure
   <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <fct>          <dbl> <int>    <int>
 1 Amy    1975     6    27     0  27.5 -79   tropical d…       NA    25     1013
 2 Amy    1975     6    27     6  28.5 -79   tropical d…       NA    25     1013
 3 Amy    1975     6    27    12  29.5 -79   tropical d…       NA    25     1013
 4 Amy    1975     6    27    18  30.5 -79   tropical d…       NA    25     1013
 5 Amy    1975     6    28     0  31.5 -78.8 tropical d…       NA    25     1012
 6 Amy    1975     6    28     6  32.4 -78.7 tropical d…       NA    25     1012
 7 Amy    1975     6    28    12  33.3 -78   tropical d…       NA    25     1011
 8 Amy    1975     6    28    18  34   -77   tropical d…       NA    30     1006
 9 Amy    1975     6    29     0  34.4 -75.8 tropical s…       NA    35     1004
10 Amy    1975     6    29     6  34   -74.8 tropical s…       NA    40     1002
# ℹ 19,527 more rows
# ℹ 3 more variables: tropicalstorm_force_diameter <int>,
#   hurricane_force_diameter <int>, summer <lgl>

Remember that if you want to keep the variable you need to assign the new dataset to an object.

Summaries

Another extremely common task is to get summaries about our dataset. We can do it with summarise() function. e.g. what if we want to see the mean and standard deviation of strength of wind (for now across all measurements):

storms %>%
  summarise(mean_wind = mean(wind, na.rm = T),
            sd_wind = sd(wind, na.rm = T))
# A tibble: 1 × 2
  mean_wind sd_wind
      <dbl>   <dbl>
1      50.0    25.5

Notice that the shape of the dataset has changed now. The columns are now the summaries and not the original variables.

Grouping rows

So far we have been calculating things on entire datasets. In many situations you want to calculate something separately for each level of a categorical variable (much like tapply() earlier). To group a dataset we can use group_by(). You can also group by multiple variables at once by separating them by a coma. R will group by the first variable and then by the second etc. This is especially useful for creating summaries. E.g. if we want to get average wind speed for each storm we can easily do it:

storms %>%
  group_by(name) %>%
  summarise(mean_wind = mean(wind, na.rm = T))
# A tibble: 260 × 2
   name     mean_wind
   <chr>        <dbl>
 1 AL011993      29.5
 2 AL012000      25  
 3 AL021992      29  
 4 AL021994      24.2
 5 AL021999      28.8
 6 AL022000      29.2
 7 AL022001      25  
 8 AL022003      30  
 9 AL022006      31.5
10 AL031987      21.2
# ℹ 250 more rows

Similarly if we want to calculate average wind speed from all measurements for each year and month:

storms %>%
  group_by(year, month) %>%
  summarise(mean_wind = mean(wind, na.rm = T))
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
# A tibble: 259 × 3
# Groups:   year [48]
    year month mean_wind
   <dbl> <dbl>     <dbl>
 1  1975     6      37.5
 2  1975     7      49.7
 3  1975     8      44.5
 4  1975     9      55.9
 5  1975    10      62.4
 6  1976     8      55.9
 7  1976     9      61.3
 8  1976    10      51.0
 9  1977     8      53  
10  1977     9      50.6
# ℹ 249 more rows

Actually the summarise() function has its own argument for calculating grouped summaries. You can specify .by argument inside the summarise() function. If you use group_by() then summarise() will by default drop the last level of grouping.

One more important thing about the group_by() function is that it works by adding an attribute to the dataframe. This means that after a group_by() all subsequent operations will be conducted on the grouped dataframe. E.g. if you sort after grouping then sorting will be conducted within each group separately. You can drop the grouping with ungroup(). There is one last special kind of grouping you might want to use - sometimes you want to perform some operation separately on each row (e.g. calculate the average of a multi item scale for each participant). You can do it with rowwise().

calculating across columns

There are situations when you want to perform the same operation on multiple columns (e.g. calculate the mean and standard deviation of multiple variables). You can do it by hand but this can be tedious. To simplify it you can use across() inside mutate() or summarise(). the syntax of across() is as follows: the first argument, .cols specifies which columns to perform operations on. The second argument .fns specifies which functions to apply. You have to provide it a list of functions (preferably named list) or a formula. Finally the .names argument specifies how to automatically assign new variable names. E.g. “{.col}_{.fn}” will create variables with column name, underscore and function name (that’s why named list is useful here). If we want to get the mean and standard deviation of wind speed and pressure in each category of storms we could do it in a few lines of code (by the way, notice how group_by() by default includes NA as a separate category):

storms %>%
  group_by(category) %>%
  summarise(across(.cols = c("wind", "pressure"), .fns = list(mean = mean, sd = sd), .names = "{.col}_{.fn}"))
# A tibble: 6 × 5
  category wind_mean wind_sd pressure_mean pressure_sd
     <dbl>     <dbl>   <dbl>         <dbl>       <dbl>
1        1      71.0    5.55          981.        9.16
2        2      89.5    3.78          967.        8.92
3        3     104.     4.19          955.        8.88
4        4     122.     6.41          940.        9.34
5        5     146.     6.22          918.       12.1 
6       NA      38.1   12.0          1002.        9.41

Helper verbs

One more useful thing for data wrangling is a set of functions making it easier to select multiple columns based on some condition. There are a few helper functions you can use to do that. They do basically what their names suggest. These verbs are as follows: starts_with(), ends_with() and contains(). This way you don’t have to manually type all the names if they have something in common (e.g. they are items from the same scale so they are named ``scale_1, scale_2 etc.). E.g. lets say we want to get all the columns that end with “diameter”:

storms %>%
  select(ends_with("diameter"))
# A tibble: 19,537 × 2
   tropicalstorm_force_diameter hurricane_force_diameter
                          <int>                    <int>
 1                           NA                       NA
 2                           NA                       NA
 3                           NA                       NA
 4                           NA                       NA
 5                           NA                       NA
 6                           NA                       NA
 7                           NA                       NA
 8                           NA                       NA
 9                           NA                       NA
10                           NA                       NA
# ℹ 19,527 more rows

Nice! These verbs also work nicely inside across(). One thing to be aware of: if no column matches what you ask for you won’t get an error but a dataframe with all the rows but 0 columns:

storms %>%
  select(contains("some_weird_name"))
# A tibble: 19,537 × 0

Combining functions together

Where tidyverse really shines is in combining multiple functions together with pipes. Through different orders of the functions described above we can get a ton of things out of our dataset. This already gives us the ability to anwser a number of questions that might be very interesting for analysis.

  • Example 1: Lets say we want to find out the name of storm from each category that had the highest average pressure in 1989.

    storms %>%
      filter(year == 1989) %>%
      group_by(category, name) %>%
      summarise(mean_pressure = mean(pressure, na.rm = T)) %>%
      slice_max(mean_pressure, n = 1)
    `summarise()` has grouped output by 'category'. You can override using the
    `.groups` argument.
    # A tibble: 6 × 3
    # Groups:   category [6]
      category name      mean_pressure
         <dbl> <chr>             <dbl>
    1        1 Chantal            987 
    2        2 Dean               972.
    3        3 Hugo               953.
    4        4 Gabrielle          943.
    5        5 Hugo               918 
    6       NA Barry             1012.
  • Example 2: Lets say we want to find the average wind speed at each hour of the day but we want that in kilometers per hours rather than knots (as is in the database). 1 knot is around 1.852 km/h.

    storms %>%
      mutate(wind_km = wind*1.852) %>%
      group_by(hour) %>%
      summarise(mean_wind_km = mean(wind, na.rm = T))
    # A tibble: 24 × 2
        hour mean_wind_km
       <dbl>        <dbl>
     1     0         49.6
     2     1         76.2
     3     2         66.2
     4     3         75  
     5     4         70.6
     6     5         76.2
     7     6         49.8
     8     7         80  
     9     8         67.4
    10     9         60.9
    # ℹ 14 more rows

Exercises

Throughout the exercises you’ll still work with the storms dataset

  • Find out which year had the most storms (remember that each row is 1 measurement of 1 storm. You have to find the year with the most storms and not the most measurements!)

  • Find the average wind speed for each storm in 2018 and then sort from the highest to the lowest.

  • Calculate the mean and standard deviation of measurement pressure for each month of the year