Example EDA

Author

John Little

Modified

February 19, 2024

library(tidyverse)
library(skimr)

Basic steps taken in this rough exploration of data….

A brief discussion of packages which proport to perform EDA for you can be found in the Get Started section of this site.

Note: This page is not intended to teach formal EDA. What happens on this page is merely a brutal re-enactment of some informal explorations that a person might take as they familiarize themselves with new data. If you’re like most people, you might want to skip to the visualizations. Meanwhile, the sections and code-chunks preceding visualization are worth a glance.

Age Distribution of runners.

Age Distribution of runners.

Import data

The data come from a TidyTuesday, a weekly social learning project dedicated to gaining practical experience with R and data science. In this case the TidyTuesday data are based on International Trail Running Association (ITRA) data but inspired by Benjamin Nowak, . We will use the TidyTuesday data that are on GitHub. Nowak’s data are also available on GitHub.

race_df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/race.csv")
rank_df <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-26/ultra_rankings.csv")
glimpse(race_df)
Rows: 1,207
Columns: 13
$ race_year_id   <dbl> 68140, 72496, 69855, 67856, 70469, 66887, 67851, 68241,…
$ event          <chr> "Peak District Ultras", "UTMB®", "Grand Raid des Pyréné…
$ race           <chr> "Millstone 100", "UTMB®", "Ultra Tour 160", "PERSENK UL…
$ city           <chr> "Castleton", "Chamonix", "vielle-Aure", "Asenovgrad", "…
$ country        <chr> "United Kingdom", "France", "France", "Bulgaria", "Turk…
$ date           <date> 2021-09-03, 2021-08-27, 2021-08-20, 2021-08-20, 2021-0…
$ start_time     <time> 19:00:00, 17:00:00, 05:00:00, 18:00:00, 18:00:00, 17:0…
$ participation  <chr> "solo", "Solo", "solo", "solo", "solo", "solo", "solo",…
$ distance       <dbl> 166.9, 170.7, 167.0, 164.0, 159.9, 159.9, 163.8, 163.9,…
$ elevation_gain <dbl> 4520, 9930, 9980, 7490, 100, 9850, 5460, 4630, 6410, 31…
$ elevation_loss <dbl> -4520, -9930, -9980, -7500, -100, -9850, -5460, -4660, …
$ aid_stations   <dbl> 10, 11, 13, 13, 12, 15, 5, 8, 13, 23, 13, 5, 12, 15, 0,…
$ participants   <dbl> 150, 2300, 600, 150, 0, 300, 0, 200, 120, 100, 300, 50,…
glimpse(rank_df)
Rows: 137,803
Columns: 8
$ race_year_id    <dbl> 68140, 68140, 68140, 68140, 68140, 68140, 68140, 68140…
$ rank            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, NA, NA, NA,…
$ runner          <chr> "VERHEUL Jasper", "MOULDING JON", "RICHARDSON Phill", …
$ time            <chr> "26H 35M 25S", "27H 0M 29S", "28H 49M 7S", "30H 53M 37…
$ age             <dbl> 30, 43, 38, 55, 48, 31, 55, 40, 47, 29, 48, 47, 52, 49…
$ gender          <chr> "M", "M", "M", "W", "W", "M", "W", "W", "M", "M", "M",…
$ nationality     <chr> "GBR", "GBR", "GBR", "GBR", "GBR", "GBR", "GBR", "GBR"…
$ time_in_seconds <dbl> 95725, 97229, 103747, 111217, 117981, 118000, 120601, …

EDA with skimr

skim(race_df)
Data summary
Name race_df
Number of rows 1207
Number of columns 13
_______________________
Column type frequency:
character 5
Date 1
difftime 1
numeric 6
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
event 0 1.00 4 57 0 435 0
race 0 1.00 3 63 0 371 0
city 172 0.86 2 30 0 308 0
country 4 1.00 4 17 0 60 0
participation 0 1.00 4 5 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2012-01-14 2021-09-03 2017-09-30 711

Variable type: difftime

skim_variable n_missing complete_rate min max median n_unique
start_time 0 1 0 secs 82800 secs 05:00:00 39

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
race_year_id 0 1 27889.65 20689.90 2320 9813.5 23565.0 42686.00 72496.0 ▇▃▃▂▂
distance 0 1 152.62 39.88 0 160.1 161.5 165.15 179.1 ▁▁▁▁▇
elevation_gain 0 1 5294.79 2872.29 0 3210.0 5420.0 7145.00 14430.0 ▅▇▇▂▁
elevation_loss 0 1 -5317.01 2899.12 -14440 -7206.5 -5420.0 -3220.00 0.0 ▁▂▇▇▅
aid_stations 0 1 8.63 7.63 0 0.0 9.0 14.00 56.0 ▇▆▁▁▁
participants 0 1 120.49 281.83 0 0.0 21.0 150.00 2900.0 ▇▁▁▁▁

Read more about automagic EAD packages.

Freewheelin’ EDA

race_df |> 
  count(country, sort = TRUE) |> 
  filter(str_detect(country, regex("Ke", ignore_case = TRUE)))
race_df |> 
  filter(country == "Turkey")
race_df |> 
  count(participation, sort = TRUE)
race_df |> 
  count(participants, sort = TRUE)
skim(rank_df)
Data summary
Name rank_df
Number of rows 137803
Number of columns 8
_______________________
Column type frequency:
character 4
numeric 4
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
runner 0 1.00 3 52 0 73629 0
time 17791 0.87 8 11 0 72840 0
gender 30 1.00 1 1 0 2 0
nationality 0 1.00 3 3 0 133 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
race_year_id 0 1.00 26678.70 20156.18 2320 8670 21795 40621 72496 ▇▃▃▂▂
rank 17791 0.87 253.56 390.80 1 31 87 235 1962 ▇▁▁▁▁
age 0 1.00 46.25 10.11 0 40 46 53 133 ▁▇▂▁▁
time_in_seconds 17791 0.87 122358.26 37234.38 3600 96566 114167 148020 296806 ▁▇▆▁▁
rank_df |> 
  filter(str_detect(nationality, regex("ken", ignore_case = TRUE)))
rank_df |> 
  arrange(rank)