2024-06-04
This material is freely available under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. Some sections are based on content from two other presentations, which are credited at the end of this presentation.
For more information on this license, please visit: Creative Commons License
tidyverse
ggplot2
- visualize stuff
dplyr
, tidyr
- data manipulation
purrr
- advanced programming
readr
- import data
tibble
- improved data.frame format
forcats
- working w/ factors
stringr
- working w/ chain of characters
purrr
- advanced programming (functions and “loops”)
A framework for managing data that aims at making the cleaning and preparing steps much easier
Main characteristics of a tidy dataset:
|>
is also a pipeKeyboard shortcut:
%>%
comes from a package called magrittr
tidyverse
dplyr
package# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
Name | starwars |
Number of rows | 87 |
Number of columns | 14 |
_______________________ | |
Column type frequency: | |
character | 8 |
list | 3 |
numeric | 3 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
name | 0 | 1.00 | 3 | 21 | 0 | 87 | 0 |
hair_color | 5 | 0.94 | 4 | 13 | 0 | 11 | 0 |
skin_color | 0 | 1.00 | 3 | 19 | 0 | 31 | 0 |
eye_color | 0 | 1.00 | 3 | 13 | 0 | 15 | 0 |
sex | 4 | 0.95 | 4 | 14 | 0 | 4 | 0 |
gender | 4 | 0.95 | 8 | 9 | 0 | 2 | 0 |
homeworld | 10 | 0.89 | 4 | 14 | 0 | 48 | 0 |
species | 4 | 0.95 | 3 | 14 | 0 | 37 | 0 |
Variable type: list
skim_variable | n_missing | complete_rate | n_unique | min_length | max_length |
---|---|---|---|---|---|
films | 0 | 1 | 24 | 1 | 7 |
vehicles | 0 | 1 | 11 | 0 | 2 |
starships | 0 | 1 | 16 | 0 | 5 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
height | 6 | 0.93 | 174.60 | 34.77 | 66 | 167.0 | 180 | 191.0 | 264 | ▂▁▇▅▁ |
mass | 28 | 0.68 | 97.31 | 169.46 | 15 | 55.6 | 79 | 84.5 | 1358 | ▇▁▁▁▁ |
birth_year | 44 | 0.49 | 87.57 | 154.69 | 8 | 35.0 | 52 | 72.0 | 896 | ▇▁▁▁▁ |
tidyverse
Let’s calculate the BMI for each character in the dataset
# A tibble: 87 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 26.0 Luke Sky… 172 77 blond fair blue 19 male
2 26.9 C-3PO 167 75 <NA> gold yellow 112 none
3 34.7 R2-D2 96 32 <NA> white, bl… red 33 none
4 33.3 Darth Va… 202 136 none white yellow 41.9 male
5 21.8 Leia Org… 150 49 brown light brown 19 fema…
6 37.9 Owen Lars 178 120 brown, gr… light blue 52 male
7 27.5 Beru Whi… 165 75 brown light blue 47 fema…
8 34.0 R5-D4 97 32 <NA> white, red red NA none
9 25.1 Biggs Da… 183 84 black light brown 24 male
10 23.2 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges.
arrange
filter
select
rename
mutate
if_else
case_when
summarise
left_join
(and family of joins)
All of the verbs have a very similar syntax:
function(data, argument)
data %>% function(argument)
arrange()
arrange() changes the ordering of the rows.
# A tibble: 87 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 12.9 Wat Tamb… 193 48 none green, gr… unknown NA male
2 13.1 Padmé Am… 185 45 brown light brown 46 fema…
3 14.8 Adi Gall… 184 50 none dark blue NA fema…
4 15.1 Sly Moore 178 48 none pale white NA <NA>
5 16.3 Roos Tar… 224 82 none grey orange NA male
6 16.8 Lama Su 229 88 none grey black NA male
7 17.2 Jar Jar … 196 66 none orange orange 52 male
8 17.4 Ayla Sec… 178 55 none blue hazel 48 fema…
9 18.0 Shaak Ti 178 57 none red, blue… black NA fema…
10 18.1 Barriss … 166 50 black yellow blue 40 fema…
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
arrange()
# A tibble: 87 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 443. Jabba De… 175 1358 <NA> green-tan… orange 600 herm…
2 50.9 Dud Bolt 94 45 none blue, grey yellow NA male
3 39.0 Yoda 66 17 white green brown 896 male
4 37.9 Owen Lars 178 120 brown, gr… light blue 52 male
5 35 IG-88 200 140 none metal red 15 none
6 34.7 R2-D2 96 32 <NA> white, bl… red 33 none
7 34.1 Grievous 216 159 none brown, wh… green, y… NA male
8 34.0 R5-D4 97 32 <NA> white, red red NA none
9 34.0 Jek Tono… 180 110 brown fair blue NA <NA>
10 33.3 Darth Va… 202 136 none white yellow 41.9 male
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
filter()
filter() picks cases based on their values.
# A tibble: 16 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 21.8 Leia Org… 150 49 brown light brown 19 fema…
2 27.5 Beru Whi… 165 75 brown light blue 47 fema…
3 NA Mon Moth… 150 NA auburn fair blue 48 fema…
4 13.1 Padmé Am… 185 45 brown light brown 46 fema…
5 NA Shmi Sky… 163 NA black fair brown 72 fema…
6 17.4 Ayla Sec… 178 55 none blue hazel 48 fema…
7 14.8 Adi Gall… 184 50 none dark blue NA fema…
8 19.4 Luminara… 170 56.2 black yellow blue 58 fema…
9 18.1 Barriss … 166 50 black yellow blue 40 fema…
10 NA Dormé 165 NA brown light brown NA fema…
11 19.5 Zam Wese… 168 55 blonde fair, gre… yellow NA fema…
12 NA Taun We 213 NA none grey black NA fema…
13 NA Jocasta … 167 NA white fair blue NA fema…
14 18.0 Shaak Ti 178 57 none red, blue… black NA fema…
15 NA Rey NA NA brown light hazel NA fema…
16 NA Captain … NA NA none none unknown NA fema…
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
filter()
# A tibble: 67 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 26.0 Luke Sky… 172 77 blond fair blue 19 male
2 26.9 C-3PO 167 75 <NA> gold yellow 112 none
3 34.7 R2-D2 96 32 <NA> white, bl… red 33 none
4 33.3 Darth Va… 202 136 none white yellow 41.9 male
5 37.9 Owen Lars 178 120 brown, gr… light blue 52 male
6 34.0 R5-D4 97 32 <NA> white, red red NA none
7 25.1 Biggs Da… 183 84 black light brown 24 male
8 23.2 Obi-Wan … 182 77 auburn, w… fair blue-gray 57 male
9 23.8 Anakin S… 188 84 blond fair blue 41.9 male
10 NA Wilhuff … 180 NA auburn, g… fair blue 64 male
# ℹ 57 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
filter()
# A tibble: 4 × 15
bmi name height mass hair_color skin_color eye_color birth_year sex
<dbl> <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
1 26.9 C-3PO 167 75 <NA> gold yellow 112 none
2 21.5 Chewbacca 228 112 brown unknown blue 200 male
3 443. Jabba Des… 175 1358 <NA> green-tan… orange 600 herm…
4 39.0 Yoda 66 17 white green brown 896 male
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
# films <list>, vehicles <list>, starships <list>
select()
select() picks variables based on their names.
# A tibble: 87 × 1
bmi
<dbl>
1 26.0
2 26.9
3 34.7
4 33.3
5 21.8
6 37.9
7 27.5
8 34.0
9 25.1
10 23.2
# ℹ 77 more rows
select()
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
select()
# A tibble: 60 × 3
name bmi sex
<chr> <dbl> <chr>
1 Wat Tambor 12.9 male
2 Roos Tarpals 16.3 male
3 Lama Su 16.8 male
4 Jar Jar Binks 17.2 male
5 Tion Medon 18.9 male
6 Ki-Adi-Mundi 20.9 male
7 Dooku 21.5 male
8 Chewbacca 21.5 male
9 Raymus Antilles 22.4 male
10 Plo Koon 22.6 male
# ℹ 50 more rows
rename()
# A tibble: 60 × 3
personaje bmi sexo
<chr> <dbl> <chr>
1 Wat Tambor 12.9 male
2 Roos Tarpals 16.3 male
3 Lama Su 16.8 male
4 Jar Jar Binks 17.2 male
5 Tion Medon 18.9 male
6 Ki-Adi-Mundi 20.9 male
7 Dooku 21.5 male
8 Chewbacca 21.5 male
9 Raymus Antilles 22.4 male
10 Plo Koon 22.6 male
# ℹ 50 more rows
rename()
rename also works within a select()
function, like this:
# A tibble: 60 × 3
personaje sexo bmi
<chr> <chr> <dbl>
1 Wat Tambor male 12.9
2 Roos Tarpals male 16.3
3 Lama Su male 16.8
4 Jar Jar Binks male 17.2
5 Tion Medon male 18.9
6 Ki-Adi-Mundi male 20.9
7 Dooku male 21.5
8 Chewbacca male 21.5
9 Raymus Antilles male 22.4
10 Plo Koon male 22.6
# ℹ 50 more rows
mutate()
my personal favorite 💜mutate() adds new variables that are functions of existing variables
# A tibble: 60 × 3
id sexo bmi
<int> <chr> <dbl>
1 75 male 12.9
2 36 male 16.3
3 71 male 16.8
4 35 male 17.2
5 82 male 18.9
6 51 male 20.9
7 66 male 21.5
8 13 male 21.5
9 80 male 22.4
10 57 male 22.6
# ℹ 50 more rows
mutate()
and if_else()
# A tibble: 87 × 3
name is_blue_eyes eye_color
<chr> <dbl> <chr>
1 Luke Skywalker 1 blue
2 Owen Lars 1 blue
3 Beru Whitesun Lars 1 blue
4 Anakin Skywalker 1 blue
5 Wilhuff Tarkin 1 blue
6 Chewbacca 1 blue
7 Jek Tono Porkins 1 blue
8 Lobot 1 blue
9 Mon Mothma 1 blue
10 Qui-Gon Jinn 1 blue
# ℹ 77 more rows
# A tibble: 87 × 3
name is_blue_eyes eye_color
<chr> <chr> <chr>
1 Luke Skywalker blue blue
2 Owen Lars blue blue
3 Beru Whitesun Lars blue blue
4 Anakin Skywalker blue blue
5 Wilhuff Tarkin blue blue
6 Chewbacca blue blue
7 Jek Tono Porkins blue blue
8 Lobot blue blue
9 Mon Mothma blue blue
10 Qui-Gon Jinn blue blue
# ℹ 77 more rows
mutate()
and case_when()
💜# A tibble: 87 × 2
is_human species
<dbl> <chr>
1 1 Human
2 0 Droid
3 0 Droid
4 1 Human
5 1 Human
6 1 Human
7 1 Human
8 0 Droid
9 1 Human
10 1 Human
# ℹ 77 more rows
mutate()
and case_when()
💜starwars_bmi %>%
mutate(bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal weight",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 & bmi < 35 ~ "Obesity Class I",
bmi >= 35 & bmi < 40 ~ "Obesity Class II",
bmi >= 40 ~ "Obesity Class III",
.default = NA_character_ # Handle cases where bmi might be NA
)) %>%
select(name, bmi, bmi_category)
# A tibble: 87 × 3
name bmi bmi_category
<chr> <dbl> <chr>
1 Luke Skywalker 26.0 Overweight
2 C-3PO 26.9 Overweight
3 R2-D2 34.7 Obesity Class I
4 Darth Vader 33.3 Obesity Class I
5 Leia Organa 21.8 Normal weight
6 Owen Lars 37.9 Obesity Class II
7 Beru Whitesun Lars 27.5 Overweight
8 R5-D4 34.0 Obesity Class I
9 Biggs Darklighter 25.1 Overweight
10 Obi-Wan Kenobi 23.2 Normal weight
# ℹ 77 more rows
summarise
I always get it wrong the first time 😒summarise() reduces multiple values down to a single summary.
# A tibble: 1 × 1
mean_bmi
<dbl>
1 32.0
summarise
starwars_bmi %>%
mutate(bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal weight",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 & bmi < 35 ~ "Obesity Class I",
bmi >= 35 & bmi < 40 ~ "Obesity Class II",
bmi >= 40 ~ "Obesity Class III",
.default = NA_character_ # Handle cases where bmi might be NA
)) %>%
filter(!is.na(bmi_category)) %>% # We created an NA ctaegory and I want to rmv
summarise(
mean_bmi = mean(bmi, na.rm = TRUE),
sd_bmi = sd(bmi, na.rm = TRUE),
min = min(bmi, na.rm = TRUE),
max = max(bmi, na.rm = TRUE),
.by = bmi_category
)
# A tibble: 6 × 5
bmi_category mean_bmi sd_bmi min max
<chr> <dbl> <dbl> <dbl> <dbl>
1 Overweight 26.1 0.676 25.1 27.5
2 Obesity Class I 33.3 1.26 31.3 34.7
3 Normal weight 22.9 1.80 18.9 24.8
4 Obesity Class II 37.3 2.07 35 39.0
5 Obesity Class III 247. 278. 50.9 443.
6 Underweight 16.0 1.90 12.9 18.1
summarise
gets
arrow > Keyboard shortcut:
starwars_bmi_cat <-
starwars_bmi %>%
mutate(bmi_category = case_when(
bmi < 18.5 ~ "Underweight",
bmi >= 18.5 & bmi < 25 ~ "Normal weight",
bmi >= 25 & bmi < 30 ~ "Overweight",
bmi >= 30 & bmi < 35 ~ "Obesity Class I",
bmi >= 35 & bmi < 40 ~ "Obesity Class II",
bmi >= 40 ~ "Obesity Class III",
.default = NA_character_ # Handle cases where bmi might be NA
)) %>%
filter(!is.na(bmi_category))
summarise
# A tibble: 41 × 6
bmi_category species mean_bmi sd_bmi min max
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Normal weight Pau'an 18.9 NA 18.9 18.9
2 Normal weight Mirialan 19.4 NA 19.4 19.4
3 Normal weight Clawdite 19.5 NA 19.5 19.5
4 Normal weight Cerean 20.9 NA 20.9 20.9
5 Normal weight Kel Dor 22.6 NA 22.6 22.6
6 Normal weight Nautolan 22.6 NA 22.6 22.6
7 Normal weight Human 23.2 1.01 21.5 24.7
8 Normal weight Wookiee 23.2 2.33 21.5 24.8
9 Normal weight Geonosian 23.9 NA 23.9 23.9
10 Normal weight Aleena 24.0 NA 24.0 24.0
11 Normal weight Toong 24.5 NA 24.5 24.5
12 Normal weight Neimodian 24.7 NA 24.7 24.7
13 Normal weight Rodian 24.7 NA 24.7 24.7
14 Normal weight <NA> 24.8 NA 24.8 24.8
15 Obesity Class I Trandoshan 31.3 NA 31.3 31.3
16 Obesity Class I Dug 31.9 NA 31.9 31.9
17 Obesity Class I Human 33.3 NA 33.3 33.3
18 Obesity Class I <NA> 34.0 NA 34.0 34.0
19 Obesity Class I Kaleesh 34.1 NA 34.1 34.1
20 Obesity Class I Droid 34.4 0.504 34.0 34.7
21 Obesity Class II Droid 35 NA 35 35
22 Obesity Class II Human 37.9 NA 37.9 37.9
23 Obesity Class II Yoda's species 39.0 NA 39.0 39.0
24 Obesity Class III Vulptereen 50.9 NA 50.9 50.9
25 Obesity Class III Hutt 443. NA 443. 443.
26 Overweight Mon Calamari 25.6 NA 25.6 25.6
27 Overweight Ewok 25.8 NA 25.8 25.8
28 Overweight Besalisk 26.0 NA 26.0 26.0
29 Overweight Human 26.0 0.847 25.1 27.5
30 Overweight Zabrak 26.1 NA 26.1 26.1
31 Overweight Sullustan 26.6 NA 26.6 26.6
32 Overweight Droid 26.9 NA 26.9 26.9
33 Underweight Skakoan 12.9 NA 12.9 12.9
34 Underweight Human 13.1 NA 13.1 13.1
35 Underweight Tholothian 14.8 NA 14.8 14.8
36 Underweight <NA> 15.1 NA 15.1 15.1
37 Underweight Gungan 16.8 0.592 16.3 17.2
38 Underweight Kaminoan 16.8 NA 16.8 16.8
39 Underweight Twi'lek 17.4 NA 17.4 17.4
40 Underweight Togruta 18.0 NA 18.0 18.0
41 Underweight Mirialan 18.1 NA 18.1 18.1
summarise
summarise
starwars_bmi_cat %>%
mutate(is_human = case_when(
species == "Human" ~ 1,
species != "Human" ~ 0,
.default = NA # Handle cases where species might be NA
)) %>%
filter(!is.na(is_human)) %>%
summarise(
mean_bmi = mean(bmi, na.rm = TRUE),
sd_bmi = sd(bmi, na.rm = TRUE),
min = min(bmi, na.rm = TRUE),
max = max(bmi, na.rm = TRUE),
.by = c(bmi_category, is_human)
) %>%
arrange(bmi_category)
# A tibble: 11 × 6
bmi_category is_human mean_bmi sd_bmi min max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Normal weight 1 23.2 1.01 21.5 24.7
2 Normal weight 0 22.5 2.21 18.9 24.8
3 Obesity Class I 0 33.2 1.51 31.3 34.7
4 Obesity Class I 1 33.3 NA 33.3 33.3
5 Obesity Class II 1 37.9 NA 37.9 37.9
6 Obesity Class II 0 37.0 2.85 35 39.0
7 Obesity Class III 0 247. 278. 50.9 443.
8 Overweight 1 26.0 0.847 25.1 27.5
9 Overweight 0 26.2 0.474 25.6 26.9
10 Underweight 1 13.1 NA 13.1 13.1
11 Underweight 0 16.4 1.78 12.9 18.1
Joins
FamilyThe outer joins keep observations that appear in at least one of the data frames:
A left_join() keeps all observations in x.
A right_join() keeps all observations in y.
A full_join() keeps all observations in x and y.
I am creating the data sets to assign a penguin to a star wars character that has the same ID
left_join()
# A tibble: 87 × 10
id name species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <fct> <fct> <dbl> <dbl> <int>
1 1 Luke Sky… <NA> <NA> NA NA NA
2 2 C-3PO Adelie Torge… 39.1 18.7 181
3 3 R2-D2 <NA> <NA> NA NA NA
4 4 Darth Va… Adelie Torge… 39.5 17.4 186
5 5 Leia Org… <NA> <NA> NA NA NA
6 6 Owen Lars Adelie Torge… 40.3 18 195
7 7 Beru Whi… <NA> <NA> NA NA NA
8 8 R5-D4 Adelie Torge… NA NA NA
9 9 Biggs Da… <NA> <NA> NA NA NA
10 10 Obi-Wan … Adelie Torge… 36.7 19.3 193
# ℹ 77 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
right_join()
# A tibble: 344 × 10
id name species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <fct> <fct> <dbl> <dbl> <int>
1 2 C-3PO Adelie Torge… 39.1 18.7 181
2 4 Darth Va… Adelie Torge… 39.5 17.4 186
3 6 Owen Lars Adelie Torge… 40.3 18 195
4 8 R5-D4 Adelie Torge… NA NA NA
5 10 Obi-Wan … Adelie Torge… 36.7 19.3 193
6 12 Wilhuff … Adelie Torge… 39.3 20.6 190
7 14 Han Solo Adelie Torge… 38.9 17.8 181
8 16 Jabba De… Adelie Torge… 39.2 19.6 195
9 18 Jek Tono… Adelie Torge… 34.1 18.1 193
10 20 Palpatine Adelie Torge… 42 20.2 190
# ℹ 334 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
full_join()
# A tibble: 388 × 10
id name species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <fct> <fct> <dbl> <dbl> <int>
1 1 Luke Sky… <NA> <NA> NA NA NA
2 2 C-3PO Adelie Torge… 39.1 18.7 181
3 3 R2-D2 <NA> <NA> NA NA NA
4 4 Darth Va… Adelie Torge… 39.5 17.4 186
5 5 Leia Org… <NA> <NA> NA NA NA
6 6 Owen Lars Adelie Torge… 40.3 18 195
7 7 Beru Whi… <NA> <NA> NA NA NA
8 8 R5-D4 Adelie Torge… NA NA NA
9 9 Biggs Da… <NA> <NA> NA NA NA
10 10 Obi-Wan … Adelie Torge… 36.7 19.3 193
# ℹ 378 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
inner_join()
# A tibble: 43 × 10
id name species island bill_length_mm bill_depth_mm flipper_length_mm
<dbl> <chr> <fct> <fct> <dbl> <dbl> <int>
1 2 C-3PO Adelie Torge… 39.1 18.7 181
2 4 Darth Va… Adelie Torge… 39.5 17.4 186
3 6 Owen Lars Adelie Torge… 40.3 18 195
4 8 R5-D4 Adelie Torge… NA NA NA
5 10 Obi-Wan … Adelie Torge… 36.7 19.3 193
6 12 Wilhuff … Adelie Torge… 39.3 20.6 190
7 14 Han Solo Adelie Torge… 38.9 17.8 181
8 16 Jabba De… Adelie Torge… 39.2 19.6 195
9 18 Jek Tono… Adelie Torge… 34.1 18.1 193
10 20 Palpatine Adelie Torge… 42 20.2 190
# ℹ 33 more rows
# ℹ 3 more variables: body_mass_g <int>, sex <fct>, year <int>
anti_join()
# A tibble: 44 × 2
id name
<int> <chr>
1 1 Luke Skywalker
2 3 R2-D2
3 5 Leia Organa
4 7 Beru Whitesun Lars
5 9 Biggs Darklighter
6 11 Anakin Skywalker
7 13 Chewbacca
8 15 Greedo
9 17 Wedge Antilles
10 19 Yoda
# ℹ 34 more rows
tidyr
This package has 2 very very useful functions you must learn.
Have you seen this before? (A show of hands)
Each row represents a single character
# A tibble: 87 × 4
name height mass hair_color
<chr> <int> <dbl> <chr>
1 Luke Skywalker 172 77 blond
2 C-3PO 167 75 <NA>
3 R2-D2 96 32 <NA>
4 Darth Vader 202 136 none
5 Leia Organa 150 49 brown
6 Owen Lars 178 120 brown, grey
7 Beru Whitesun Lars 165 75 brown
8 R5-D4 97 32 <NA>
9 Biggs Darklighter 183 84 black
10 Obi-Wan Kenobi 182 77 auburn, white
# ℹ 77 more rows
Each row would represent a single attribute of a character.
# A tibble: 261 × 3
name attribute value
<chr> <chr> <chr>
1 Luke Skywalker height 172
2 Luke Skywalker mass 77
3 Luke Skywalker hair_color blond
4 C-3PO height 167
5 C-3PO mass 75
6 C-3PO hair_color <NA>
7 R2-D2 height 96
8 R2-D2 mass 32
9 R2-D2 hair_color <NA>
10 Darth Vader height 202
# ℹ 251 more rows
pivot_longer()
and pivot_wider()
pivot_longer()
“lengthens” data, increasing the number of rows and decreasing the number of columns. The inverse transformation is pivot_wider()
pivot_wider()
“widens” data, increasing the number of columns and decreasing the number of rows. The inverse transformation is pivot_longer().
pivot_longer()
Remember to look at the documentation here
# A tibble: 261 × 3
name attribute value
<chr> <chr> <chr>
1 Luke Skywalker height 172
2 Luke Skywalker mass 77
3 Luke Skywalker hair_color blond
4 C-3PO height 167
5 C-3PO mass 75
6 C-3PO hair_color <NA>
7 R2-D2 height 96
8 R2-D2 mass 32
9 R2-D2 hair_color <NA>
10 Darth Vader height 202
# ℹ 251 more rows
pivot_wider()
Also, look at the documentation here
# A tibble: 87 × 4
name height mass hair_color
<chr> <chr> <chr> <chr>
1 Luke Skywalker 172 77 blond
2 C-3PO 167 75 <NA>
3 R2-D2 96 32 <NA>
4 Darth Vader 202 136 none
5 Leia Organa 150 49 brown
6 Owen Lars 178 120 brown, grey
7 Beru Whitesun Lars 165 75 brown
8 R5-D4 97 32 <NA>
9 Biggs Darklighter 183 84 black
10 Obi-Wan Kenobi 182 77 auburn, white
# ℹ 77 more rows
ggplot
!!I used material from: Oliver Gimenez and L. Paloma Rojas.
Thank you!