Session 2 - The Data Jedi Academy: Cleaning Your Data with R and Tidyverse

Catalina Cañizares, Ph.D. and Raymond Balise Ph.D.

2024-06-04

About This Material

This material is freely available under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Some sections are based on content from two other presentations, which are credited at the end of this presentation.

For more information on this license, please visit: Creative Commons License

We will go as fast as the slowest person in the room!

Objectives for Today

  1. Understand the Basics of Data Cleaning: Learn fundamental concepts and techniques for cleaning data using R and tidyverse.
  2. Explore Tidyverse Tools: Familiarize with key Tidyverse packages such as dplyr, and tidyr for data manipulation and cleaning.
  3. Apply Data Cleaning Techniques: Practice applying data cleaning methods to various datasets to prepare them for analysis.

Introduction to the tidyverse

Tidyverse

  • A collection of R 📦s developed by H. Wickham and others at Rstudio

Tidyverse is a collection of R 📦s

  • ggplot2 - visualize stuff

  • dplyr, tidyr - data manipulation

  • purrr - advanced programming

  • readr - import data

  • tibble - improved data.frame format

  • forcats - working w/ factors

  • stringr - working w/ chain of characters

Tidyverse is a collection of R 📦s

Tidyverse

  • A framework for managing data that aims at making the cleaning and preparing steps much easier

  • Main characteristics of a tidy dataset:

    • Each variable is a column
    • Each observation is a raw
    • Each value is in a different cell

When do we clean the data?

Workflow in data science, with Tidyverse

The Pipe operator %>% (and then)

The Pipe operator %>% (and then)

The Pipe operator %>% (and then)

The Pipe operator %>% (and then)

The Pipe operator %>% (and then)

  • This |> is also a pipe

Keyboard shortcut:

  • Mac users: command + shift + m
  • Windows users: ctrl + shift + m

Fun fact

  • The %>%comes from a package called magrittr
  • A clear reference of the famous painting “The Treachery of Images” of the Belgian painter René Magritte.

Difference between base R and tidyverse

  • Star Wars data from the dplyr package

Exploring the Star Wars dataset

starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
skim(starwars)
Data summary
Name starwars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 11 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 16 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.60 34.77 66 167.0 180 191.0 264 ▂▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

Difference between base R and tidyverse

Let’s calculate the BMI for each character in the dataset

Base R 🤢

starwars$bmi <- starwars$mass/(starwars$height/100)^2

tidyverse 😎

starwars_bmi <- 
  starwars %>% 
  mutate(bmi = mass/((height/100)^2)) %>% 
  select(bmi, everything()) # To see the created variable first
# A tibble: 87 × 15
     bmi name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1  26.0 Luke Sky…    172    77 blond      fair       blue            19   male 
 2  26.9 C-3PO        167    75 <NA>       gold       yellow         112   none 
 3  34.7 R2-D2         96    32 <NA>       white, bl… red             33   none 
 4  33.3 Darth Va…    202   136 none       white      yellow          41.9 male 
 5  21.8 Leia Org…    150    49 brown      light      brown           19   fema…
 6  37.9 Owen Lars    178   120 brown, gr… light      blue            52   male 
 7  27.5 Beru Whi…    165    75 brown      light      blue            47   fema…
 8  34.0 R5-D4         97    32 <NA>       white, red red             NA   none 
 9  25.1 Biggs Da…    183    84 black      light      brown           24   male 
10  23.2 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

5 minute break

dplyr

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.

  • We will learn the following functions:
  • arrange

  • filter

  • select

  • rename

  • mutate

    • if_else
    • case_when
  • summarise

  • left_join (and family of joins)

dplyr

Verb syntax

All of the verbs have a very similar syntax:

function(data, argument)

  • But, we work with pipes..
  • Therefore the data comes before the verb, like this:

data %>% function(argument)

arrange()

arrange() changes the ordering of the rows.

starwars_bmi %>% 
  arrange(bmi)
# A tibble: 87 × 15
     bmi name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1  12.9 Wat Tamb…    193    48 none       green, gr… unknown           NA male 
 2  13.1 Padmé Am…    185    45 brown      light      brown             46 fema…
 3  14.8 Adi Gall…    184    50 none       dark       blue              NA fema…
 4  15.1 Sly Moore    178    48 none       pale       white             NA <NA> 
 5  16.3 Roos Tar…    224    82 none       grey       orange            NA male 
 6  16.8 Lama Su      229    88 none       grey       black             NA male 
 7  17.2 Jar Jar …    196    66 none       orange     orange            52 male 
 8  17.4 Ayla Sec…    178    55 none       blue       hazel             48 fema…
 9  18.0 Shaak Ti     178    57 none       red, blue… black             NA fema…
10  18.1 Barriss …    166    50 black      yellow     blue              40 fema…
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

arrange()

starwars_bmi %>% 
  arrange(desc(bmi))
# A tibble: 87 × 15
     bmi name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 443.  Jabba De…    175  1358 <NA>       green-tan… orange         600   herm…
 2  50.9 Dud Bolt      94    45 none       blue, grey yellow          NA   male 
 3  39.0 Yoda          66    17 white      green      brown          896   male 
 4  37.9 Owen Lars    178   120 brown, gr… light      blue            52   male 
 5  35   IG-88        200   140 none       metal      red             15   none 
 6  34.7 R2-D2         96    32 <NA>       white, bl… red             33   none 
 7  34.1 Grievous     216   159 none       brown, wh… green, y…       NA   male 
 8  34.0 R5-D4         97    32 <NA>       white, red red             NA   none 
 9  34.0 Jek Tono…    180   110 brown      fair       blue            NA   <NA> 
10  33.3 Darth Va…    202   136 none       white      yellow          41.9 male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

filter()

filter() picks cases based on their values.

starwars_bmi %>% 
  filter(sex == "female")
# A tibble: 16 × 15
     bmi name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1  21.8 Leia Org…    150  49   brown      light      brown             19 fema…
 2  27.5 Beru Whi…    165  75   brown      light      blue              47 fema…
 3  NA   Mon Moth…    150  NA   auburn     fair       blue              48 fema…
 4  13.1 Padmé Am…    185  45   brown      light      brown             46 fema…
 5  NA   Shmi Sky…    163  NA   black      fair       brown             72 fema…
 6  17.4 Ayla Sec…    178  55   none       blue       hazel             48 fema…
 7  14.8 Adi Gall…    184  50   none       dark       blue              NA fema…
 8  19.4 Luminara…    170  56.2 black      yellow     blue              58 fema…
 9  18.1 Barriss …    166  50   black      yellow     blue              40 fema…
10  NA   Dormé        165  NA   brown      light      brown             NA fema…
11  19.5 Zam Wese…    168  55   blonde     fair, gre… yellow            NA fema…
12  NA   Taun We      213  NA   none       grey       black             NA fema…
13  NA   Jocasta …    167  NA   white      fair       blue              NA fema…
14  18.0 Shaak Ti     178  57   none       red, blue… black             NA fema…
15  NA   Rey           NA  NA   brown      light      hazel             NA fema…
16  NA   Captain …     NA  NA   none       none       unknown           NA fema…
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

filter()

starwars_bmi %>% 
  filter(sex != "female")
# A tibble: 67 × 15
     bmi name      height  mass hair_color skin_color eye_color birth_year sex  
   <dbl> <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1  26.0 Luke Sky…    172    77 blond      fair       blue            19   male 
 2  26.9 C-3PO        167    75 <NA>       gold       yellow         112   none 
 3  34.7 R2-D2         96    32 <NA>       white, bl… red             33   none 
 4  33.3 Darth Va…    202   136 none       white      yellow          41.9 male 
 5  37.9 Owen Lars    178   120 brown, gr… light      blue            52   male 
 6  34.0 R5-D4         97    32 <NA>       white, red red             NA   none 
 7  25.1 Biggs Da…    183    84 black      light      brown           24   male 
 8  23.2 Obi-Wan …    182    77 auburn, w… fair       blue-gray       57   male 
 9  23.8 Anakin S…    188    84 blond      fair       blue            41.9 male 
10  NA   Wilhuff …    180    NA auburn, g… fair       blue            64   male 
# ℹ 57 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

filter()

starwars_bmi %>% 
  filter(birth_year >= 112)
# A tibble: 4 × 15
    bmi name       height  mass hair_color skin_color eye_color birth_year sex  
  <dbl> <chr>       <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
1  26.9 C-3PO         167    75 <NA>       gold       yellow           112 none 
2  21.5 Chewbacca     228   112 brown      unknown    blue             200 male 
3 443.  Jabba Des…    175  1358 <NA>       green-tan… orange           600 herm…
4  39.0 Yoda           66    17 white      green      brown            896 male 
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

select()

select() picks variables based on their names.

starwars_bmi %>% 
  select(bmi)
# A tibble: 87 × 1
     bmi
   <dbl>
 1  26.0
 2  26.9
 3  34.7
 4  33.3
 5  21.8
 6  37.9
 7  27.5
 8  34.0
 9  25.1
10  23.2
# ℹ 77 more rows

select()

starwars_bmi %>% 
  select(-bmi)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

select()

starwars_bmi %>% 
  select(name, bmi, sex) %>% 
  filter(sex == "male") %>% 
  arrange(bmi)
# A tibble: 60 × 3
   name              bmi sex  
   <chr>           <dbl> <chr>
 1 Wat Tambor       12.9 male 
 2 Roos Tarpals     16.3 male 
 3 Lama Su          16.8 male 
 4 Jar Jar Binks    17.2 male 
 5 Tion Medon       18.9 male 
 6 Ki-Adi-Mundi     20.9 male 
 7 Dooku            21.5 male 
 8 Chewbacca        21.5 male 
 9 Raymus Antilles  22.4 male 
10 Plo Koon         22.6 male 
# ℹ 50 more rows

rename()

starwars_bmi %>% 
  select(name, bmi, sex) %>% 
  filter(sex == "male") %>% 
  arrange(bmi) %>% 
  rename(personaje = name, sexo = sex)
# A tibble: 60 × 3
   personaje         bmi sexo 
   <chr>           <dbl> <chr>
 1 Wat Tambor       12.9 male 
 2 Roos Tarpals     16.3 male 
 3 Lama Su          16.8 male 
 4 Jar Jar Binks    17.2 male 
 5 Tion Medon       18.9 male 
 6 Ki-Adi-Mundi     20.9 male 
 7 Dooku            21.5 male 
 8 Chewbacca        21.5 male 
 9 Raymus Antilles  22.4 male 
10 Plo Koon         22.6 male 
# ℹ 50 more rows

rename()

rename also works within a select() function, like this:

starwars_bmi %>% 
  select(personaje = name, sexo = sex, bmi) %>% 
  filter(sexo == "male") %>% 
  arrange(bmi)
# A tibble: 60 × 3
   personaje       sexo    bmi
   <chr>           <chr> <dbl>
 1 Wat Tambor      male   12.9
 2 Roos Tarpals    male   16.3
 3 Lama Su         male   16.8
 4 Jar Jar Binks   male   17.2
 5 Tion Medon      male   18.9
 6 Ki-Adi-Mundi    male   20.9
 7 Dooku           male   21.5
 8 Chewbacca       male   21.5
 9 Raymus Antilles male   22.4
10 Plo Koon        male   22.6
# ℹ 50 more rows

mutate() my personal favorite 💜

mutate() adds new variables that are functions of existing variables

starwars_bmi %>% 
  mutate(id = row_number()) %>% 
  select(id, personaje = name, sexo = sex, bmi) %>% 
  filter(sexo == "male") %>% 
  arrange(bmi) %>% 
  select(-personaje)
# A tibble: 60 × 3
      id sexo    bmi
   <int> <chr> <dbl>
 1    75 male   12.9
 2    36 male   16.3
 3    71 male   16.8
 4    35 male   17.2
 5    82 male   18.9
 6    51 male   20.9
 7    66 male   21.5
 8    13 male   21.5
 9    80 male   22.4
10    57 male   22.6
# ℹ 50 more rows

mutate() and if_else()

starwars_bmi %>% 
  mutate(is_blue_eyes = if_else(eye_color == "blue", 1, 0)) %>% 
  select(name,is_blue_eyes, eye_color) %>% 
  arrange(desc(is_blue_eyes))
# A tibble: 87 × 3
   name               is_blue_eyes eye_color
   <chr>                     <dbl> <chr>    
 1 Luke Skywalker                1 blue     
 2 Owen Lars                     1 blue     
 3 Beru Whitesun Lars            1 blue     
 4 Anakin Skywalker              1 blue     
 5 Wilhuff Tarkin                1 blue     
 6 Chewbacca                     1 blue     
 7 Jek Tono Porkins              1 blue     
 8 Lobot                         1 blue     
 9 Mon Mothma                    1 blue     
10 Qui-Gon Jinn                  1 blue     
# ℹ 77 more rows
starwars_bmi %>% 
  mutate(is_blue_eyes = if_else(eye_color == "blue", "blue", "other")) %>% 
  select(name, is_blue_eyes, eye_color) %>% 
  arrange(is_blue_eyes)
# A tibble: 87 × 3
   name               is_blue_eyes eye_color
   <chr>              <chr>        <chr>    
 1 Luke Skywalker     blue         blue     
 2 Owen Lars          blue         blue     
 3 Beru Whitesun Lars blue         blue     
 4 Anakin Skywalker   blue         blue     
 5 Wilhuff Tarkin     blue         blue     
 6 Chewbacca          blue         blue     
 7 Jek Tono Porkins   blue         blue     
 8 Lobot              blue         blue     
 9 Mon Mothma         blue         blue     
10 Qui-Gon Jinn       blue         blue     
# ℹ 77 more rows

mutate() and case_when()💜

starwars %>%
  mutate(is_human = case_when(
    species == "Human" ~ 1,
    species != "Human" ~ 0, 
    .default = NA  # Handle cases where species might be NA
  )) %>% 
  select(is_human, species)
# A tibble: 87 × 2
   is_human species
      <dbl> <chr>  
 1        1 Human  
 2        0 Droid  
 3        0 Droid  
 4        1 Human  
 5        1 Human  
 6        1 Human  
 7        1 Human  
 8        0 Droid  
 9        1 Human  
10        1 Human  
# ℹ 77 more rows

mutate() and case_when()💜

starwars_bmi %>%
  mutate(bmi_category = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi >= 18.5 & bmi < 25 ~ "Normal weight",
    bmi >= 25 & bmi < 30 ~ "Overweight",
    bmi >= 30 & bmi < 35 ~ "Obesity Class I",
    bmi >= 35 & bmi < 40 ~ "Obesity Class II",
    bmi >= 40 ~ "Obesity Class III",
    .default =  NA_character_  # Handle cases where bmi might be NA
  )) %>% 
  select(name, bmi, bmi_category)
# A tibble: 87 × 3
   name                 bmi bmi_category    
   <chr>              <dbl> <chr>           
 1 Luke Skywalker      26.0 Overweight      
 2 C-3PO               26.9 Overweight      
 3 R2-D2               34.7 Obesity Class I 
 4 Darth Vader         33.3 Obesity Class I 
 5 Leia Organa         21.8 Normal weight   
 6 Owen Lars           37.9 Obesity Class II
 7 Beru Whitesun Lars  27.5 Overweight      
 8 R5-D4               34.0 Obesity Class I 
 9 Biggs Darklighter   25.1 Overweight      
10 Obi-Wan Kenobi      23.2 Normal weight   
# ℹ 77 more rows

summarise I always get it wrong the first time 😒

summarise() reduces multiple values down to a single summary.

starwars_bmi %>% 
  summarise(
    mean_bmi = mean(bmi, na.rm = TRUE)
  )
# A tibble: 1 × 1
  mean_bmi
     <dbl>
1     32.0

summarise

starwars_bmi %>%
  mutate(bmi_category = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi >= 18.5 & bmi < 25 ~ "Normal weight",
    bmi >= 25 & bmi < 30 ~ "Overweight",
    bmi >= 30 & bmi < 35 ~ "Obesity Class I",
    bmi >= 35 & bmi < 40 ~ "Obesity Class II",
    bmi >= 40 ~ "Obesity Class III",
    .default =  NA_character_  # Handle cases where bmi might be NA
  )) %>% 
  filter(!is.na(bmi_category)) %>% # We created an NA ctaegory and I want to rmv
  summarise(
    mean_bmi = mean(bmi, na.rm = TRUE), 
    sd_bmi = sd(bmi, na.rm = TRUE), 
    min = min(bmi, na.rm = TRUE), 
    max = max(bmi, na.rm = TRUE), 
    .by = bmi_category
  )
# A tibble: 6 × 5
  bmi_category      mean_bmi  sd_bmi   min   max
  <chr>                <dbl>   <dbl> <dbl> <dbl>
1 Overweight            26.1   0.676  25.1  27.5
2 Obesity Class I       33.3   1.26   31.3  34.7
3 Normal weight         22.9   1.80   18.9  24.8
4 Obesity Class II      37.3   2.07   35    39.0
5 Obesity Class III    247.  278.     50.9 443. 
6 Underweight           16.0   1.90   12.9  18.1

summarise

  • This is an example of code using the <- gets arrow > Keyboard shortcut:
    • Mac users: option + - (minus key)
    • Windows users: Alt + - (minus key)
  1. I create and store as an object the dataset with the categories
starwars_bmi_cat <- 
  starwars_bmi %>%
  mutate(bmi_category = case_when(
    bmi < 18.5 ~ "Underweight",
    bmi >= 18.5 & bmi < 25 ~ "Normal weight",
    bmi >= 25 & bmi < 30 ~ "Overweight",
    bmi >= 30 & bmi < 35 ~ "Obesity Class I",
    bmi >= 35 & bmi < 40 ~ "Obesity Class II",
    bmi >= 40 ~ "Obesity Class III",
    .default =  NA_character_  # Handle cases where bmi might be NA
  )) %>% 
  filter(!is.na(bmi_category))

summarise

  1. I use the dataset
starwars_bmi_cat %>% 
  summarise(
    mean_bmi = mean(bmi, na.rm = TRUE), 
    sd_bmi = sd(bmi, na.rm = TRUE), 
    min = min(bmi, na.rm = TRUE), 
    max = max(bmi, na.rm = TRUE), 
    .by = c(bmi_category, species)
  ) %>% 
  arrange(bmi_category, mean_bmi) %>% 
  print(n = 41)
# A tibble: 41 × 6
   bmi_category      species        mean_bmi sd_bmi   min   max
   <chr>             <chr>             <dbl>  <dbl> <dbl> <dbl>
 1 Normal weight     Pau'an             18.9 NA      18.9  18.9
 2 Normal weight     Mirialan           19.4 NA      19.4  19.4
 3 Normal weight     Clawdite           19.5 NA      19.5  19.5
 4 Normal weight     Cerean             20.9 NA      20.9  20.9
 5 Normal weight     Kel Dor            22.6 NA      22.6  22.6
 6 Normal weight     Nautolan           22.6 NA      22.6  22.6
 7 Normal weight     Human              23.2  1.01   21.5  24.7
 8 Normal weight     Wookiee            23.2  2.33   21.5  24.8
 9 Normal weight     Geonosian          23.9 NA      23.9  23.9
10 Normal weight     Aleena             24.0 NA      24.0  24.0
11 Normal weight     Toong              24.5 NA      24.5  24.5
12 Normal weight     Neimodian          24.7 NA      24.7  24.7
13 Normal weight     Rodian             24.7 NA      24.7  24.7
14 Normal weight     <NA>               24.8 NA      24.8  24.8
15 Obesity Class I   Trandoshan         31.3 NA      31.3  31.3
16 Obesity Class I   Dug                31.9 NA      31.9  31.9
17 Obesity Class I   Human              33.3 NA      33.3  33.3
18 Obesity Class I   <NA>               34.0 NA      34.0  34.0
19 Obesity Class I   Kaleesh            34.1 NA      34.1  34.1
20 Obesity Class I   Droid              34.4  0.504  34.0  34.7
21 Obesity Class II  Droid              35   NA      35    35  
22 Obesity Class II  Human              37.9 NA      37.9  37.9
23 Obesity Class II  Yoda's species     39.0 NA      39.0  39.0
24 Obesity Class III Vulptereen         50.9 NA      50.9  50.9
25 Obesity Class III Hutt              443.  NA     443.  443. 
26 Overweight        Mon Calamari       25.6 NA      25.6  25.6
27 Overweight        Ewok               25.8 NA      25.8  25.8
28 Overweight        Besalisk           26.0 NA      26.0  26.0
29 Overweight        Human              26.0  0.847  25.1  27.5
30 Overweight        Zabrak             26.1 NA      26.1  26.1
31 Overweight        Sullustan          26.6 NA      26.6  26.6
32 Overweight        Droid              26.9 NA      26.9  26.9
33 Underweight       Skakoan            12.9 NA      12.9  12.9
34 Underweight       Human              13.1 NA      13.1  13.1
35 Underweight       Tholothian         14.8 NA      14.8  14.8
36 Underweight       <NA>               15.1 NA      15.1  15.1
37 Underweight       Gungan             16.8  0.592  16.3  17.2
38 Underweight       Kaminoan           16.8 NA      16.8  16.8
39 Underweight       Twi'lek            17.4 NA      17.4  17.4
40 Underweight       Togruta            18.0 NA      18.0  18.0
41 Underweight       Mirialan           18.1 NA      18.1  18.1

summarise

  • Was it useful?

  • Can we make it better?

summarise

starwars_bmi_cat %>%
  mutate(is_human = case_when(
    species == "Human" ~ 1,
    species != "Human" ~ 0,
    .default = NA # Handle cases where species might be NA
  )) %>%
  filter(!is.na(is_human)) %>% 
  summarise(
    mean_bmi = mean(bmi, na.rm = TRUE),
    sd_bmi = sd(bmi, na.rm = TRUE),
    min = min(bmi, na.rm = TRUE),
    max = max(bmi, na.rm = TRUE),
    .by = c(bmi_category, is_human)
  ) %>% 
  arrange(bmi_category)
# A tibble: 11 × 6
   bmi_category      is_human mean_bmi  sd_bmi   min   max
   <chr>                <dbl>    <dbl>   <dbl> <dbl> <dbl>
 1 Normal weight            1     23.2   1.01   21.5  24.7
 2 Normal weight            0     22.5   2.21   18.9  24.8
 3 Obesity Class I          0     33.2   1.51   31.3  34.7
 4 Obesity Class I          1     33.3  NA      33.3  33.3
 5 Obesity Class II         1     37.9  NA      37.9  37.9
 6 Obesity Class II         0     37.0   2.85   35    39.0
 7 Obesity Class III        0    247.  278.     50.9 443. 
 8 Overweight               1     26.0   0.847  25.1  27.5
 9 Overweight               0     26.2   0.474  25.6  26.9
10 Underweight              1     13.1  NA      13.1  13.1
11 Underweight              0     16.4   1.78   12.9  18.1

The Joins Family

Inner join:

  • An inner_join() only keeps observations from x that have a matching key in y.

Outer joins:

The outer joins keep observations that appear in at least one of the data frames:

  • A left_join() keeps all observations in x.

  • A right_join() keeps all observations in y.

  • A full_join() keeps all observations in x and y.

An unnrealisitc example

I am creating the data sets to assign a penguin to a star wars character that has the same ID

penguins <-
  palmerpenguins::penguins

penguins_math_starwars <- 
  penguins %>%
  mutate(id = seq(2, by = 2, length.out = nrow(penguins))) %>%
  select(id, everything())
starwars_match_penguins <- 
  starwars %>% 
  mutate(id = seq(nrow(starwars))) %>% 
  select(id, everything())

left_join()

starwars_match_penguins %>% 
  select(id, name) %>% 
  left_join(penguins_math_starwars, by = "id") 
# A tibble: 87 × 10
      id name      species island bill_length_mm bill_depth_mm flipper_length_mm
   <dbl> <chr>     <fct>   <fct>           <dbl>         <dbl>             <int>
 1     1 Luke Sky… <NA>    <NA>             NA            NA                  NA
 2     2 C-3PO     Adelie  Torge…           39.1          18.7               181
 3     3 R2-D2     <NA>    <NA>             NA            NA                  NA
 4     4 Darth Va… Adelie  Torge…           39.5          17.4               186
 5     5 Leia Org… <NA>    <NA>             NA            NA                  NA
 6     6 Owen Lars Adelie  Torge…           40.3          18                 195
 7     7 Beru Whi… <NA>    <NA>             NA            NA                  NA
 8     8 R5-D4     Adelie  Torge…           NA            NA                  NA
 9     9 Biggs Da… <NA>    <NA>             NA            NA                  NA
10    10 Obi-Wan … Adelie  Torge…           36.7          19.3               193
# ℹ 77 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

right_join()

starwars_match_penguins %>% 
  select(id, name) %>% 
  right_join(penguins_math_starwars, by = "id")
# A tibble: 344 × 10
      id name      species island bill_length_mm bill_depth_mm flipper_length_mm
   <dbl> <chr>     <fct>   <fct>           <dbl>         <dbl>             <int>
 1     2 C-3PO     Adelie  Torge…           39.1          18.7               181
 2     4 Darth Va… Adelie  Torge…           39.5          17.4               186
 3     6 Owen Lars Adelie  Torge…           40.3          18                 195
 4     8 R5-D4     Adelie  Torge…           NA            NA                  NA
 5    10 Obi-Wan … Adelie  Torge…           36.7          19.3               193
 6    12 Wilhuff … Adelie  Torge…           39.3          20.6               190
 7    14 Han Solo  Adelie  Torge…           38.9          17.8               181
 8    16 Jabba De… Adelie  Torge…           39.2          19.6               195
 9    18 Jek Tono… Adelie  Torge…           34.1          18.1               193
10    20 Palpatine Adelie  Torge…           42            20.2               190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

full_join()

starwars_match_penguins %>% 
  select(id, name) %>% 
  full_join(penguins_math_starwars, by = "id")
# A tibble: 388 × 10
      id name      species island bill_length_mm bill_depth_mm flipper_length_mm
   <dbl> <chr>     <fct>   <fct>           <dbl>         <dbl>             <int>
 1     1 Luke Sky… <NA>    <NA>             NA            NA                  NA
 2     2 C-3PO     Adelie  Torge…           39.1          18.7               181
 3     3 R2-D2     <NA>    <NA>             NA            NA                  NA
 4     4 Darth Va… Adelie  Torge…           39.5          17.4               186
 5     5 Leia Org… <NA>    <NA>             NA            NA                  NA
 6     6 Owen Lars Adelie  Torge…           40.3          18                 195
 7     7 Beru Whi… <NA>    <NA>             NA            NA                  NA
 8     8 R5-D4     Adelie  Torge…           NA            NA                  NA
 9     9 Biggs Da… <NA>    <NA>             NA            NA                  NA
10    10 Obi-Wan … Adelie  Torge…           36.7          19.3               193
# ℹ 378 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

inner_join()

starwars_match_penguins %>% 
  select(id, name) %>% 
  inner_join(penguins_math_starwars, by = "id")
# A tibble: 43 × 10
      id name      species island bill_length_mm bill_depth_mm flipper_length_mm
   <dbl> <chr>     <fct>   <fct>           <dbl>         <dbl>             <int>
 1     2 C-3PO     Adelie  Torge…           39.1          18.7               181
 2     4 Darth Va… Adelie  Torge…           39.5          17.4               186
 3     6 Owen Lars Adelie  Torge…           40.3          18                 195
 4     8 R5-D4     Adelie  Torge…           NA            NA                  NA
 5    10 Obi-Wan … Adelie  Torge…           36.7          19.3               193
 6    12 Wilhuff … Adelie  Torge…           39.3          20.6               190
 7    14 Han Solo  Adelie  Torge…           38.9          17.8               181
 8    16 Jabba De… Adelie  Torge…           39.2          19.6               195
 9    18 Jek Tono… Adelie  Torge…           34.1          18.1               193
10    20 Palpatine Adelie  Torge…           42            20.2               190
# ℹ 33 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>

And a bonus: anti_join()

starwars_match_penguins %>% 
  select(id, name) %>% 
  anti_join(penguins_math_starwars, by = "id")
# A tibble: 44 × 2
      id name              
   <int> <chr>             
 1     1 Luke Skywalker    
 2     3 R2-D2             
 3     5 Leia Organa       
 4     7 Beru Whitesun Lars
 5     9 Biggs Darklighter 
 6    11 Anakin Skywalker  
 7    13 Chewbacca         
 8    15 Greedo            
 9    17 Wedge Antilles    
10    19 Yoda              
# ℹ 34 more rows

Another break?

Briefly…

tidyr

  • This package has 2 very very useful functions you must learn.

  • Have you seen this before? (A show of hands)

Long data vs. Wide data

Wide format

Each row represents a single character

# A tibble: 87 × 4
   name               height  mass hair_color   
   <chr>               <int> <dbl> <chr>        
 1 Luke Skywalker        172    77 blond        
 2 C-3PO                 167    75 <NA>         
 3 R2-D2                  96    32 <NA>         
 4 Darth Vader           202   136 none         
 5 Leia Organa           150    49 brown        
 6 Owen Lars             178   120 brown, grey  
 7 Beru Whitesun Lars    165    75 brown        
 8 R5-D4                  97    32 <NA>         
 9 Biggs Darklighter     183    84 black        
10 Obi-Wan Kenobi        182    77 auburn, white
# ℹ 77 more rows

Long format

Each row would represent a single attribute of a character.

# A tibble: 261 × 3
   name           attribute  value
   <chr>          <chr>      <chr>
 1 Luke Skywalker height     172  
 2 Luke Skywalker mass       77   
 3 Luke Skywalker hair_color blond
 4 C-3PO          height     167  
 5 C-3PO          mass       75   
 6 C-3PO          hair_color <NA> 
 7 R2-D2          height     96   
 8 R2-D2          mass       32   
 9 R2-D2          hair_color <NA> 
10 Darth Vader    height     202  
# ℹ 251 more rows

pivot_longer() and pivot_wider()

  • pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()

  • pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().

pivot_longer()

Remember to look at the documentation here

starwars_wide <- 
  starwars %>%
  mutate(across(-name, as.character)) %>% 
  select(name, height, mass, hair_color) %>% 
  pivot_longer(cols = -name, # Keep the 'name' column fixed
               names_to = "attribute", # Name for the new attribute column
               values_to = "value") # Name for the new value column

starwars_wide
# A tibble: 261 × 3
   name           attribute  value
   <chr>          <chr>      <chr>
 1 Luke Skywalker height     172  
 2 Luke Skywalker mass       77   
 3 Luke Skywalker hair_color blond
 4 C-3PO          height     167  
 5 C-3PO          mass       75   
 6 C-3PO          hair_color <NA> 
 7 R2-D2          height     96   
 8 R2-D2          mass       32   
 9 R2-D2          hair_color <NA> 
10 Darth Vader    height     202  
# ℹ 251 more rows

pivot_wider()

Also, look at the documentation here

starwars_wide %>%
  pivot_wider(names_from = attribute, values_from = value)
# A tibble: 87 × 4
   name               height mass  hair_color   
   <chr>              <chr>  <chr> <chr>        
 1 Luke Skywalker     172    77    blond        
 2 C-3PO              167    75    <NA>         
 3 R2-D2              96     32    <NA>         
 4 Darth Vader        202    136   none         
 5 Leia Organa        150    49    brown        
 6 Owen Lars          178    120   brown, grey  
 7 Beru Whitesun Lars 165    75    brown        
 8 R5-D4              97     32    <NA>         
 9 Biggs Darklighter  183    84    black        
10 Obi-Wan Kenobi     182    77    auburn, white
# ℹ 77 more rows

We are ready for ggplot!!

Credit

I used material from: Oliver Gimenez and L. Paloma Rojas.

Thank you!