Join two data frames

Modified

February 19, 2024

library(tidyverse)

Join

There are a series of join commands

  • left_join, inner_join, right_join, full_join,
  • semi_join, anti_join

{dplyr} joins visualized in venn diagrams

First let’s import the favorability ratings data from fivethirtyeight.com

data

These exercises use the following ggplot2 training datasets

  • dplyr::starwars
  • Data from fivethrityeight.org (modified)
fav_ratings <- read_csv("data/538_favorability_popularity.csv", skip = 11)
fav_ratings
ABCDEFGHIJ0123456789
name
<chr>
fav_rating
<dbl>
Han Solo610
Luke Skywalker552
Princess Leia Organa547
Anakin Skywalker245
Obi Wan Kenobi591
Emperor Palpatine110
Darth Vader310
Lando Calrissian142
Boba Fett138
C-3P0474
starwars
ABCDEFGHIJ0123456789
name
<chr>
height
<int>
mass
<dbl>
hair_color
<chr>
Luke Skywalker17277.0blond
C-3PO16775.0NA
R2-D29632.0NA
Darth Vader202136.0none
Leia Organa15049.0brown
Owen Lars178120.0brown, grey
Beru Whitesun lars16575.0brown
R5-D49732.0NA
Biggs Darklighter18384.0black
Obi-Wan Kenobi18277.0auburn, white

Join on a key that is common across two data frames. For best results use a numeric key that promotes precision. Unfortunately our data frames do not have a numeric key to use as our join key. We’ll use an alphabetic key, name, and this will highlight what can go wrong. along with troubleshooting strategies.

The name variable is the key because it is common to both tables: fav_ratings and starwars.

fav_ratings  |>  
  left_join(starwars, by = join_by(name))  |> 
  arrange(desc(fav_rating))
ABCDEFGHIJ0123456789
name
<chr>
fav_rating
<dbl>
height
<int>
mass
<dbl>
hair_color
<chr>
Han Solo61018080.0brown
Yoda6056617.0white
Obi Wan Kenobi591NANANA
R2 D2562NANANA
Luke Skywalker55217277.0blond
Princess Leia Organa547NANANA
C-3P0474NANANA
Darth Vader310202136.0none
Anakin Skywalker24518884.0blond
Padme Amidala168NANANA

anti_join

anti_join will determine data values that are in the left table and not the right table.

anti_join(fav_ratings, starwars) |> arrange(name)
Joining with `by = join_by(name)`
ABCDEFGHIJ0123456789
name
<chr>
fav_rating
<dbl>
C-3P0474
Emperor Palpatine110
Obi Wan Kenobi591
Padme Amidala168
Princess Leia Organa547
R2 D2562
anti_join(starwars, fav_ratings) |> arrange(name)
Joining with `by = join_by(name)`
ABCDEFGHIJ0123456789
name
<chr>
height
<int>
mass
<dbl>
hair_color
<chr>
Ackbar18083.0none
Adi Gallia18450.0none
Arvel CrynydNANAbrown
Ayla Secura17855.0none
BB8NANAnone
Bail Prestor Organa191NAblack
Barriss Offee16650.0black
Ben Quadinaros16365.0none
Beru Whitesun lars16575.0brown
Bib Fortuna180NAnone

semi_join or inner_join

Join only the rows that match.

fav_ratings |> 
  semi_join(starwars)
Joining with `by = join_by(name)`
ABCDEFGHIJ0123456789
name
<chr>
fav_rating
<dbl>
Han Solo610
Luke Skywalker552
Anakin Skywalker245
Darth Vader310
Lando Calrissian142
Boba Fett138
Jar Jar Binks112
Yoda605

Regex

Regular expressions (regex) and {stringr} can be handy when manipulating character variables into join keys. We use regex when troubleshooting what is not matching. In other computatons we might also use regex for more sophisticated text manipulations, specifically for finding patterns in text.

  • str_detect()
  • str_to_lower()
  • str_remove()
  • str_trim()
  • str_extract()
starwars %>% 
  filter(str_detect(name, 
                    regex("3p|palpatine|obi|amidala|leia|d2", 
                          ignore_case = TRUE))) %>% 
  arrange(name)
ABCDEFGHIJ0123456789
name
<chr>
height
<int>
mass
<dbl>
hair_color
<chr>
skin_color
<chr>
C-3PO16775NAgold
Leia Organa15049brownlight
Obi-Wan Kenobi18277auburn, whitefair
Padmé Amidala16545brownlight
Palpatine17075greypale
R2-D29632NAwhite, blue
fav_ratings %>% 
  filter(str_detect(name, 
                    regex("3p|palpatine|obi|amidala|leia|d2", 
                          ignore_case = TRUE))) %>% 
  arrange(name)
ABCDEFGHIJ0123456789
name
<chr>
fav_rating
<dbl>
C-3P0474
Emperor Palpatine110
Obi Wan Kenobi591
Padme Amidala168
Princess Leia Organa547
R2 D2562

See Also

  • dplyr::bind_rows()

  • dplyr::bind_cols()