Restructuring and joining data

Author

Michał Wypych

Restructuring data

We know a bit about wrangling data. Now we will deal with situations in which your data is not in the correct shape to allow you to calculate what you want. We will look at separating and uniting variables (e.g. what if your dataset has separate columns for year, month and day but you need those in 1 variable?), joining datasets (e.g. what if some information is in 1 dataset but other information you need is in another one?) and reshaping data (going from wide to long format and back again).

We’ll learn seaparting and uniting variables by looking at a dataset aboutScooby Doo episodes by plummye. The dataset is taken from Kaggle. This dataset has a loot of variables that are either generally about a given episode or about specific characters (like monsters or members of the scooby gang).

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

scooby <- read_csv("data/scooby.csv")

New names:
Rows: 603 Columns: 76
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(60): series_name, network, season, title, imdb, engagement, format, mo... dbl
(6): ...1, index, run_time, monster_amount, suspects_amount, culprit_a... lgl
(9): unmask_other, caught_other, caught_not, door_gag, batman, scooby_... date
(1): date_aired
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`

Separating and uniting variables

In order to separate 1 variable into more you can use separate(). The opposite operation can be done with unite().When separating you need to specify which variable to split, what are the names of the new variables (passed as a character vector) and what is the separator which basically tells R where to “cut” the old variable into new ones. By default the old variable is removed from the dataset. Lets look at the date_aired variable. It stores the year, month and day of when each episode was aired. Lets say we wanted to split it into three variables: year, month and day. We can easily do it with separate():

scooby_separated <- scooby %>% separate(date_aired, into = c("year", "month", "day"), sep = "-")

scooby_separated %>%
  select(year, month, day) %>%
  head()

# A tibble: 6 × 3
  year  month day  
  <chr> <chr> <chr>
1 1969  09    13   
2 1969  09    20   
3 1969  09    27   
4 1969  10    04   
5 1969  10    11   
6 1969  10    18

Yay, we got 3 new variables just like we wanted! Uniting is very similar, it just has a reversed effect. You specify what should the name of the new variable be, what are the names of variables to unite and what R should use to separate the values from the old variables.

#unite example
scooby_united <- unite(scooby_separated, col = "date_aired", year:day, sep = "-")

head(scooby_united$date_aired)

[1] "1969-09-13" "1969-09-20" "1969-09-27" "1969-10-04" "1969-10-11"
[6] "1969-10-18"

A potential problem with separate() is when various rows have different number of values. Then you might get conflicting number of columns to create. For example the monster_type column stores information on the types of monsters that were present in a given episode. The problem is various episodes had different numbers of monsters and they’re all stored in a single column. They are all separated by commas. In order to use separate() we need to know how many columns to create. We can do it by finding out what is the maximum number of commas and adding 1. We can do it quickly with th str_count() function from stringr package which counts occurences of a string.

max(str_count(scooby$monster_type, ","))

[1] 18

Ok, the maximum number of commas is 18 so there were maximum of 19 monster in a given episode. Now we’ll need to prepare the names for new variables as we’d rather not type 19 names by hand and then separate

a <- "monster_type_"
b <- c(1:19)
vars <- paste0(a, b)

scooby_separated <- separate(scooby, monster_type, into = vars,  sep = ",")

Warning: Expected 19 pieces. Missing pieces filled with `NA` in 602 rows [1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

scooby_separated %>%
  select(monster_type_1:monster_type_19) %>%
  head()

# A tibble: 6 × 19
  monster_type_1   monster_type_2 monster_type_3 monster_type_4 monster_type_5
  <chr>            <chr>          <chr>          <chr>          <chr>         
1 Possessed Object <NA>           <NA>           <NA>           <NA>          
2 Ghost            <NA>           <NA>           <NA>           <NA>          
3 Ghost            <NA>           <NA>           <NA>           <NA>          
4 Ancient          <NA>           <NA>           <NA>           <NA>          
5 Ancient          <NA>           <NA>           <NA>           <NA>          
6 Ghost            <NA>           <NA>           <NA>           <NA>          
# ℹ 14 more variables: monster_type_6 <chr>, monster_type_7 <chr>,
#   monster_type_8 <chr>, monster_type_9 <chr>, monster_type_10 <chr>,
#   monster_type_11 <chr>, monster_type_12 <chr>, monster_type_13 <chr>,
#   monster_type_14 <chr>, monster_type_15 <chr>, monster_type_16 <chr>,
#   monster_type_17 <chr>, monster_type_18 <chr>, monster_type_19 <chr>

Notice the warning that said for rows where there were not enough values to fill all 19 variables the rest was filled with missing values.

An alternative way to separate values is to create new rows rather than columns. This way you can avoid the problem with needing to know the number of columns to create. You can do it with separate_rows(). It will split a given column values into multiple rows and duplicate all the values from other columns:

scooby_separated <- separate_rows(scooby, monster_subtype, sep = ",")
nrow(scooby_separated)

[1] 1148

We end up with a dataframe that has 1148 rows. Each row now corresponds to 1 monster per episode. If we count rows (titlevariable) and extract the highest value we should get 19:

count(scooby_separated, title) %>%
  slice_max(n, n = 1)

# A tibble: 1 × 2
  title                                      n
  <chr>                                  <int>
1 Scooby-Doo! and the Reluctant Werewolf    19

Joining data

In many situations the information that you need is not stored in a single dataset but in multiple ones. For example you might be working on a longitudinal study and each wave is saved in a separate dataset. Another common situation in which data is stored in multiple files (or tables) is to reduce redundancies. Imagine you store information about book authors and the books they published. You have some information about each author like the date of birth, nationality, awards etc. and some information about each book like the title, release date and genre. If you wanted to store it all in one table keeping 1 book per row would probably be most natural. However, then you would need to duplicate information about authors for every book they published. If the dataset is big this might prove to be a real issue. It might be easier to keep 1 table with author information and another table with book information which would also include 1 column to map books to authors. When working on both datasets you might want to join these two tables.

To look at joining data we’ll use 3 datasets that contain information about United Nations Roll calls. The data comes from Harvard Dataverse.

issues <- read_csv("data/issues.csv")

Rows: 5745 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): short_name, issue
dbl (1): rcid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

roll_calls <- read_csv("data/roll_calls.csv")

Rows: 6202 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): unres, short, descr
dbl  (5): rcid, session, importantvote, amend, para
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

votes <- read_csv("data/unvotes.csv")

Rows: 869937 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): country, country_code, vote
dbl (1): rcid

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The first dataset, issues stores information on the roll call id: rcid variable (what we will be joining on) and the name of the issue a given roll call was about:

glimpse(issues)

Rows: 5,745
Columns: 3
$ rcid       <dbl> 77, 9001, 9002, 9003, 9004, 9005, 9006, 128, 129, 130, 131,…
$ short_name <chr> "me", "me", "me", "me", "me", "me", "me", "me", "me", "me",…
$ issue      <chr> "Palestinian conflict", "Palestinian conflict", "Palestinia…

The second dataset contains information about specific roll calls like dates and descriptions:

glimpse(roll_calls)

Rows: 6,202
Columns: 9
$ rcid          <dbl> 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,…
$ session       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
$ importantvote <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ date          <date> 1946-01-01, 1946-01-02, 1946-01-04, 1946-01-04, 1946-01…
$ unres         <chr> "R/1/66", "R/1/79", "R/1/98", "R/1/107", "R/1/295", "R/1…
$ amend         <dbl> 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,…
$ para          <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,…
$ short         <chr> "AMENDMENTS, RULES OF PROCEDURE", "SECURITY COUNCIL ELEC…
$ descr         <chr> "TO ADOPT A CUBAN AMENDMENT TO THE UK PROPOSAL REFERRING…

Finally, votes dataset contains roll call ids and information about how each country voted (yes, no or abstain):

glimpse(votes)

Rows: 869,937
Columns: 4
$ rcid         <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
$ country      <chr> "United States", "Canada", "Cuba", "Haiti", "Dominican Re…
$ country_code <chr> "US", "CA", "CU", "HT", "DO", "MX", "GT", "HN", "SV", "NI…
$ vote         <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "y…

Imagine you want to analyze how many times each country voted in a specific way on each issue. In order to do that we need to join the issues and votes dataframes. All join functions in tidyr end with _join. They differ in what is kept or removed from the dataset when joining. In many situations not all records in two datasets will match (e.g. there is dropout between first and second wave of a longitudinal study so not every row in wave 1 will have a matching row in wave 2). You can deal with it in 3 ways: keep only the matching rows (inner_join() function), keep all rows from one dataset and remove non-matching rows from the other (right_join() and left_join() functions) or keep all the rows from both datasets (full_join() function). When performing a join you can also specify a by argument that controls which columns should be used for the join. If you don’t specify this argument R will automatically try to join on all columns that have the same names in both datasets.

Lets try and answer our question about how countries voted on specific issues:

votes %>%
  inner_join(issues, by = "rcid") %>%
  group_by(issue, country) %>%
  count(vote)

Warning in inner_join(., issues, by = "rcid"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 382 of `x` matches multiple rows in `y`.
ℹ Row 3009 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

# A tibble: 3,469 × 4
# Groups:   issue, country [1,195]
   issue                        country     vote        n
   <chr>                        <chr>       <chr>   <int>
 1 Arms control and disarmament Afghanistan abstain    69
 2 Arms control and disarmament Afghanistan no         19
 3 Arms control and disarmament Afghanistan yes       939
 4 Arms control and disarmament Albania     abstain   114
 5 Arms control and disarmament Albania     no        168
 6 Arms control and disarmament Albania     yes       468
 7 Arms control and disarmament Algeria     abstain   109
 8 Arms control and disarmament Algeria     no          2
 9 Arms control and disarmament Algeria     yes       929
10 Arms control and disarmament Andorra     abstain   112
# ℹ 3,459 more rows

Notice that we got a warning about a many-to-many relationship. That’s because in the votes dataset multiple rows have the same rcid value (multiple countries voted in the same roll calls) and in the issues datasets certain roll calls have more than one issue.

Apart from the classic joins there are also filtering joins. They don’t really join datasets but they filter the datasets to keep the rows that match (semi_join()) or don’t match (anti_join()). Using them we can e.g. find out for which roll calls we don’t have information on their issues:

roll_calls %>%
  anti_join(issues, by = "rcid")

# A tibble: 2,103 × 9
    rcid session importantvote date       unres   amend  para short        descr
   <dbl>   <dbl>         <dbl> <date>     <chr>   <dbl> <dbl> <chr>        <chr>
 1     3       1             0 1946-01-01 R/1/66      1     0 AMENDMENTS,… "TO …
 2     4       1             0 1946-01-02 R/1/79      0     0 SECURITY CO… "TO …
 3     5       1             0 1946-01-04 R/1/98      0     0 VOTING PROC… "TO …
 4     7       1             0 1946-01-02 R/1/295     1     0 GENERAL ASS… "TO …
 5     9       1             0 1946-02-05 R/1/329     0     0 POST-WAR RE… "TO …
 6    10       1             0 1946-02-05 R/1/361     1     1 U.N. MEMBER… "TO …
 7    12       1             0 1946-02-06 R/1/394     1     1 COUNCIL MEM… "TO …
 8    13       1             0 1946-02-01 R/1/434     1     1 PROPAGANDA,… "TO …
 9    14       1             0 1946-02-01 R/1/435     1     1 PERSONNEL, … "TO …
10    15       1             0 1946-02-01 R/1/435     1     1 QUISLINGS, … "TO …
# ℹ 2,093 more rows

One final note on joins: their behaviour might sometimes feel a bit unintuitive when dealing with duplicated values. Imagine two dataframe like below. What do you think will happen when we make an inner join on the id variable?

table1 <- data.frame(id = c(1, 1),
                     a = c(1, 2))
table2 <- data.frame(id = c(1, 1),
                     b = c(3, 4))

table1

  id a
1  1 1
2  1 2

table2

  id b
1  1 3
2  1 4

What we get is a many-to-many mapping just like with the unvotes join we made:

table1 %>%
  inner_join(table2)

Joining with `by = join_by(id)`

Warning in inner_join(., table2): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.

R automatically join every matching row from table 1 with every matching row from table 2. This might seem confusing but is actually a quite desirable behaviour. Remember the example with 2 tables on books: one with authors and one with books? When joining we want to match each author to every book they have written not just to the first or last one. If you want to have more control you can use the multiple or relationship arguments.

From wide to long format and back again

The last thing we’ll cover here is changing the format of your data from wide to long or vice versa. For many datasets there are 2 ways in which you can store information.

In wide format you generally store 1 observation in one row (e.g. one participant across all the waves of a study), In long format one row is one measurement for one observation (e.g. 1 wave for 1 participant). Lets look at the votes data frame. if we are interested in roll calls its currently in a long format - each row is 1 vote in a given roll call for one country. In wide format we could have 1 row per roll call and 1 column per country.

You can reshape the dataset into a wide format using pivot_wider(). You need to specify 3 things: from which column to derive values for new columns, (values_from) from which columns to use values for new variable names (names_from) and a set of observations that uniquely identify each observation. (id_cols). The last argument is needed so that we end up with 1 value per cell. We can reshape the votes data frame like this:

votes_wide <- votes %>%
  pivot_wider(id_cols = "rcid", names_from = "country", values_from = "vote")

votes_wide

# A tibble: 6,202 × 201
    rcid `United States` Canada Cuba  Haiti `Dominican Republic` Mexico
   <dbl> <chr>           <chr>  <chr> <chr> <chr>                <chr> 
 1     3 yes             no     yes   yes   yes                  yes   
 2     4 no              no     no    no    no                   no    
 3     5 no              no     yes   no    no                   yes   
 4     6 no              no     yes   <NA>  abstain              yes   
 5     7 no              no     yes   yes   yes                  yes   
 6     8 no              yes    yes   <NA>  yes                  yes   
 7     9 yes             yes    yes   yes   yes                  yes   
 8    10 yes             yes    yes   yes   yes                  yes   
 9    11 yes             yes    yes   <NA>  yes                  <NA>  
10    12 yes             yes    yes   <NA>  yes                  no    
# ℹ 6,192 more rows
# ℹ 194 more variables: Guatemala <chr>, Honduras <chr>, `El Salvador` <chr>,
#   Nicaragua <chr>, `Costa Rica` <chr>, Panama <chr>, Colombia <chr>,
#   Venezuela <chr>, Ecuador <chr>, Peru <chr>, Brazil <chr>, Bolivia <chr>,
#   Paraguay <chr>, Chile <chr>, Argentina <chr>, Uruguay <chr>,
#   `United Kingdom` <chr>, Netherlands <chr>, Belgium <chr>, Luxembourg <chr>,
#   France <chr>, Poland <chr>, Czechoslovakia <chr>, Yugoslavia <chr>, …

You could reshape the data in a different way so that each country is in 1 row and each column is a separate roll call:

votes_wide2 <- votes %>%
  pivot_wider(id_cols = "country", names_from = "rcid", values_from = "vote")

votes_wide2

# A tibble: 200 × 6,203
   country     `3`   `4`   `5`   `6`   `7`   `8`   `9`   `10`  `11`  `12`  `13` 
   <chr>       <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
 1 United Sta… yes   no    no    no    no    no    yes   yes   yes   yes   no   
 2 Canada      no    no    no    no    no    yes   yes   yes   yes   yes   no   
 3 Cuba        yes   no    yes   yes   yes   yes   yes   yes   yes   yes   no   
 4 Haiti       yes   no    no    <NA>  yes   <NA>  yes   yes   <NA>  <NA>  <NA> 
 5 Dominican … yes   no    no    abst… yes   yes   yes   yes   yes   yes   no   
 6 Mexico      yes   no    yes   yes   yes   yes   yes   yes   <NA>  no    <NA> 
 7 Guatemala   yes   no    no    no    <NA>  yes   <NA>  yes   <NA>  <NA>  <NA> 
 8 Honduras    yes   no    yes   yes   yes   <NA>  <NA>  <NA>  yes   yes   no   
 9 El Salvador yes   no    yes   abst… <NA>  <NA>  yes   no    <NA>  <NA>  no   
10 Nicaragua   yes   no    yes   yes   yes   <NA>  <NA>  no    <NA>  <NA>  <NA> 
# ℹ 190 more rows
# ℹ 6,191 more variables: `14` <chr>, `15` <chr>, `16` <chr>, `17` <chr>,
#   `18` <chr>, `19` <chr>, `20` <chr>, `21` <chr>, `22` <chr>, `23` <chr>,
#   `24` <chr>, `25` <chr>, `26` <chr>, `27` <chr>, `28` <chr>, `29` <chr>,
#   `30` <chr>, `31` <chr>, `32` <chr>, `33` <chr>, `34` <chr>, `35` <chr>,
#   `36` <chr>, `37` <chr>, `38` <chr>, `39` <chr>, `40` <chr>, `41` <chr>,
#   `42` <chr>, `43` <chr>, `44` <chr>, `45` <chr>, `46` <chr>, `47` <chr>, …

There is no one correct way to shape your data. Instead the shape of your data should match the question you want to answer.

You can reshape the dataset into a long format using pivot_longer(). You nned to pass it a data frame, columns to reshape and how to name he new column with values and the new column with names. The latter will use variable names as its values. We can go back to the long format of votes:

votes_long <- votes_wide %>%
  pivot_longer(cols = "United States":"South Sudan", names_to = "country", values_to = "vote")

votes_long

# A tibble: 1,240,400 × 3
    rcid country            vote 
   <dbl> <chr>              <chr>
 1     3 United States      yes  
 2     3 Canada             no   
 3     3 Cuba               yes  
 4     3 Haiti              yes  
 5     3 Dominican Republic yes  
 6     3 Mexico             yes  
 7     3 Guatemala          yes  
 8     3 Honduras           yes  
 9     3 El Salvador        yes  
10     3 Nicaragua          yes  
# ℹ 1,240,390 more rows

Exercises

Using the datasets on UN roll calls. You can get the data by installing unvotes package.

Find out which country voted the most in 1946 regardless of what kind of vote it was.
Find out how many amendments each issue had across all roll calls
Find out on which issue there was most agreement (highest percentage of the same votes) between USA and Poland