library(tidyverse)Join two data frames
Join
There are a series of join commands
- left_join, inner_join, right_join, full_join,
- semi_join, anti_join

First let’s import the favorability ratings data from fivethirtyeight.com
data
These exercises use the following ggplot2 training datasets
- dplyr::starwars
- Data from fivethrityeight.org (modified)
fav_ratings <- read_csv("data/538_favorability_popularity.csv", skip = 11)
fav_ratingsname <chr> | fav_rating <dbl> | |||
|---|---|---|---|---|
| Han Solo | 610 | |||
| Luke Skywalker | 552 | |||
| Princess Leia Organa | 547 | |||
| Anakin Skywalker | 245 | |||
| Obi Wan Kenobi | 591 | |||
| Emperor Palpatine | 110 | |||
| Darth Vader | 310 | |||
| Lando Calrissian | 142 | |||
| Boba Fett | 138 | |||
| C-3P0 | 474 |
starwarsname <chr> | height <int> | mass <dbl> | hair_color <chr> | |
|---|---|---|---|---|
| Luke Skywalker | 172 | 77.0 | blond | |
| C-3PO | 167 | 75.0 | NA | |
| R2-D2 | 96 | 32.0 | NA | |
| Darth Vader | 202 | 136.0 | none | |
| Leia Organa | 150 | 49.0 | brown | |
| Owen Lars | 178 | 120.0 | brown, grey | |
| Beru Whitesun lars | 165 | 75.0 | brown | |
| R5-D4 | 97 | 32.0 | NA | |
| Biggs Darklighter | 183 | 84.0 | black | |
| Obi-Wan Kenobi | 182 | 77.0 | auburn, white |
Join on a key that is common across two data frames. For best results use a numeric key that promotes precision. Unfortunately our data frames do not have a numeric key to use as our join key. We’ll use an alphabetic key, name, and this will highlight what can go wrong. along with troubleshooting strategies.
The name variable is the key because it is common to both tables: fav_ratings and starwars.
fav_ratings |>
left_join(starwars, by = join_by(name)) |>
arrange(desc(fav_rating))name <chr> | fav_rating <dbl> | height <int> | mass <dbl> | hair_color <chr> | |
|---|---|---|---|---|---|
| Han Solo | 610 | 180 | 80.0 | brown | |
| Yoda | 605 | 66 | 17.0 | white | |
| Obi Wan Kenobi | 591 | NA | NA | NA | |
| R2 D2 | 562 | NA | NA | NA | |
| Luke Skywalker | 552 | 172 | 77.0 | blond | |
| Princess Leia Organa | 547 | NA | NA | NA | |
| C-3P0 | 474 | NA | NA | NA | |
| Darth Vader | 310 | 202 | 136.0 | none | |
| Anakin Skywalker | 245 | 188 | 84.0 | blond | |
| Padme Amidala | 168 | NA | NA | NA |
anti_join
anti_join will determine data values that are in the left table and not the right table.
anti_join(fav_ratings, starwars) |> arrange(name)Joining with `by = join_by(name)`
name <chr> | fav_rating <dbl> | |||
|---|---|---|---|---|
| C-3P0 | 474 | |||
| Emperor Palpatine | 110 | |||
| Obi Wan Kenobi | 591 | |||
| Padme Amidala | 168 | |||
| Princess Leia Organa | 547 | |||
| R2 D2 | 562 |
anti_join(starwars, fav_ratings) |> arrange(name)Joining with `by = join_by(name)`
name <chr> | height <int> | mass <dbl> | hair_color <chr> | |
|---|---|---|---|---|
| Ackbar | 180 | 83.0 | none | |
| Adi Gallia | 184 | 50.0 | none | |
| Arvel Crynyd | NA | NA | brown | |
| Ayla Secura | 178 | 55.0 | none | |
| BB8 | NA | NA | none | |
| Bail Prestor Organa | 191 | NA | black | |
| Barriss Offee | 166 | 50.0 | black | |
| Ben Quadinaros | 163 | 65.0 | none | |
| Beru Whitesun lars | 165 | 75.0 | brown | |
| Bib Fortuna | 180 | NA | none |
semi_join or inner_join
Join only the rows that match.
fav_ratings |>
semi_join(starwars)Joining with `by = join_by(name)`
name <chr> | fav_rating <dbl> | |||
|---|---|---|---|---|
| Han Solo | 610 | |||
| Luke Skywalker | 552 | |||
| Anakin Skywalker | 245 | |||
| Darth Vader | 310 | |||
| Lando Calrissian | 142 | |||
| Boba Fett | 138 | |||
| Jar Jar Binks | 112 | |||
| Yoda | 605 |
Regex
Regular expressions (regex) and {stringr} can be handy when manipulating character variables into join keys. We use regex when troubleshooting what is not matching. In other computatons we might also use regex for more sophisticated text manipulations, specifically for finding patterns in text.
- str_detect()
- str_to_lower()
- str_remove()
- str_trim()
- str_extract()
starwars %>%
filter(str_detect(name,
regex("3p|palpatine|obi|amidala|leia|d2",
ignore_case = TRUE))) %>%
arrange(name)name <chr> | height <int> | mass <dbl> | hair_color <chr> | skin_color <chr> | |
|---|---|---|---|---|---|
| C-3PO | 167 | 75 | NA | gold | |
| Leia Organa | 150 | 49 | brown | light | |
| Obi-Wan Kenobi | 182 | 77 | auburn, white | fair | |
| Padmé Amidala | 165 | 45 | brown | light | |
| Palpatine | 170 | 75 | grey | pale | |
| R2-D2 | 96 | 32 | NA | white, blue |
fav_ratings %>%
filter(str_detect(name,
regex("3p|palpatine|obi|amidala|leia|d2",
ignore_case = TRUE))) %>%
arrange(name)name <chr> | fav_rating <dbl> | |||
|---|---|---|---|---|
| C-3P0 | 474 | |||
| Emperor Palpatine | 110 | |||
| Obi Wan Kenobi | 591 | |||
| Padme Amidala | 168 | |||
| Princess Leia Organa | 547 | |||
| R2 D2 | 562 |
See Also
dplyr::bind_rows()
dplyr::bind_cols()