Session 2 - ggplot2

Catalina Cañizares PhD
Francisco Cardozo MSc
Raymond Balise PhD

June 5, 2024

About This Material

Session 2 - ggplot2 © 2024 by Catalina Canizares, Francisco Cardozo, and Raymond Balise is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

This material is freely available under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

For more information on this license, please visit: Creative Commons License

For whom is this workshop?

Cat People

Dog People

Agenda

Why should we care?

I have 4 different data sets with 11 observations each and 2 variables:

Table



Lets look at them:

Importance of Data Viz

  • Uncovering Patterns: Data visualization helps in identifying and understanding complex patterns and relationships in data.

  • Simplifying the Complex: It transforms intricate data into a format that’s easier to grasp, effectively narrating a story.

  • Engaging Presentation: Visual representations are more visually appealing and engaging for the audience.

  • Effective Communication: Enables clear and concise communication of information, facilitating better understanding and decision-making.

Gestalt

  • Gestalt psychology is a school of thought dedicated to understanding the human brain’s perception and interpretation of experiences.
  • Researchers in this field have identified various laws of perception or principles, which are based on our inherent tendency to find order in chaos.

The prinicples applied to data visualization

Proximity

Things that are spatially near to one another seem to be related.

Similarity

Things that look alike seem to be related.

Closure

Our brains tend to ignore gaps and complete structures with open areas

Connection

Things that are visually tied to one another seem to be related.

Preattentive Processing or Pop-out

Some objects in our visual field are easier to see than others.

Pop-out makes some things on a data graphic easier to see or find than others.

Find the blue circle:

Preattentive Processing or Pop-out

Foundations Set, Let’s Navigate Further!

🌟 What We’ve Learned:

  • Why should we care.
  • Leveraging Our Brain’s Perception to Make Clearer Graphs.

Next Step: Perfecting Graph Construction

Crafting Clear and Compelling Graphs

The Musts in a Graph

  • Ease of Interpretation: Do not require the reader to think unnecessarily hard.
  • Cognitive Comfort: lighten the cognitive load when processing.
  • Honest Scaling: Avoid misleading through y-axis manipulation.

Graphical excellence is the well-designed presentation of interesting data (Tufte, 1983, p 51)

Ease of Interpretation

What are the common sources from which adolescents obtain the alcohol they consume?

Ease of Interpretation

  • Perspective Distortion

  • Inconsistent Baselines

  • Visual Complexity

  • Distraction from the Data

Cognitive Comfort

What is the distribution of family-provided alcohol to adolescents across different racial groups?

Cognitive Comfort

  • Difficulty in Comparing Sections

  • Ineffective for Large Number of Categories

  • Reliance on Color or Patterns

  • Area Perception Issues

  • Difficulty in Reading Exact Values

Honest Scaling

This chart shows the average number of Facebook likes on posts by pages of the political left. The point of this chart was to show the disparity between Mr Corbyn’s posts and others

Quick Experiment

https://ig.ft.com/science-of-charts/

These results replicate

In the 1980s, Cleveland and McGill ran experiments where participants estimated and compared values in charts

The results of Cleveland and McGill

The overall pattern of results seems clear: performance worsening substantially as we move away from comparison on a common scale to length-based comparisons to angles and finally areas

In conclusion

These findings strongly suggest that there are better and worse ways of visually representing data when estimating and comparing values within the graph.

Key Takeaways:

🚫💭 Anything that makes a viewer need to think is bad!

📊 3D Charts: Misleading and Overcomplicated

🥧 Pie Charts: Hard to Compare Accurately

🍩 Donut Charts: Stylish, Yet Less Functional

📚 Stacked Graphics: Can Obscure Data Details

🖋️ Extra Ink: Clutters and Confuses

🔑 Key/Legend: Necessary but Keep It Simple

Applying the Takeaways

Designing with Precision: Checklist

🔍 Include Sample Size in Title/Caption.

🗑️ Cut Extra Information.

🖌️ Remove Unnecessary Ink.

💡 Highlight Key Points.

🌈 Use Mnemonic Colors (Color-Blind Friendly).

🔖 Directly Label Categories.

📏 Think about the Range of the Y Axis

Data Visalization Common in Social Work

We consulted six distinct journals, extracting 23 random articles from sections highlighting the most cited, most downloaded, and most recently published works.

We found

ggplot2

ggplot2

  • On June 10, 2007, Hadley Wickham officially released ggplot2
  • An open-source data visualization package for R.
  • ggplot2 is an implementation of Leland Wilkinson’s Grammar of Graphics—a general scheme for data visualization
  • It is a system for declarativley creating grpahics

The components of a ggplot

Component Function Explanation
Data ggplot(data) The raw data that you want to visualize.
Aesthetics aes() Links your data to how it’s shown on the graph.
Geometries geom_*() The geometric shape of a layer representing the data.

Advantages of ggplot

  • Active and helpful community.
  • Very flexible, you can create layer plot specifications.
  • You can create themes for desired appearance.
  • Reproducibility

The ggplot2 Showcase

Ariane Aumaitre - Reference

The ggplot2 Showcase

Recreating the New York Times COVID-19 Spiral Graph

The ggplot2 Showcase

Recreating the Economist Barplot

The ggplot2 Showcase

Where do tornado outbreaks usually occur?

Reference - Tanya Shapiro

A Navigation-Through Example

We are here

Our Mission

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

To Follow The Code

Or click here

The Data Set

  • Source: Youth Risk Behavioral Surveillance System 2017, 2019, and 2021
  • Variables: Alcohol use in adolescents
  • R package: RiskExplorer
# install.packages("tidyverse")
# install.packages("devtools")
# install_github("ccani007/dissertationData")

library(tidyverse)
library(RiskExplorer)

data("youthAlcoholUse")

The Data Set

gtExtras::gt_plt_summary(youthAlcoholUse)
youthAlcoholUse
3385 rows x 12 cols
Column Plot Overview Missing Mean Median SD
Year 2017, 2019 and 2021
3 categories 0.0%
Age 1218 0.8% 16.5 17.0 1.2
Sex Female and Male
2 categories 1.1%
Grade 12, 11, 10 and 9
4 categories 1.2%
Race White, Multiple-Hispanic, Hispanic/Latino, Black or African American, Multiple-Non-Hispanic, Asian, Am Indian/Alaska Native and Native Hawaiian/Other PI
8 categories 1.6%
SexOrientation Heterosexual, Bisexual, Not sure and Gay or Lesbian
4 categories 6.3%
TimesPhysicalFight 012 1.6% 1.4 0.0 2.8
AgeFirstAlcohol 817 0.0% 12.7 13.0 2.5
HowManyDaysAlcoholInMonth 230 0.0% 8.1 5.0 7.6
SourceAlcohol 6, 5, 8, 7, 2, 3 and 4
7 categories 4.2%
DaysOfBingeDrinking 120 0.0% 4.5 2.0 5.1
LargestNumberOfDrinks 410 0.0% 7.1 7.0 2.2

Organizing the Data

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data <- 
  youthAlcoholUse |> 
  summarise(mean = mean(HowManyDaysAlcoholInMonth), .by = c(Year)) 

data |> 
  gt::gt()
Year mean
2017 8.205185
2019 7.824135
2021 8.263975

First Attempt

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

# Component 1: The data
data |>
  ggplot(
# Component 2: Aesthetics
    aes(x = Year, y = mean)) +
# Component 3:Geometries
  geom_bar(stat = "identity")

Organizing the Data II

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data_sex <- 
  youthAlcoholUse |> 
  filter(!is.na(Sex)) |> 
  summarise(mean = mean(HowManyDaysAlcoholInMonth),
            sd = sd(HowManyDaysAlcoholInMonth), 
            min = min(HowManyDaysAlcoholInMonth),
            p25 = quantile(HowManyDaysAlcoholInMonth, .25),
            median = median(HowManyDaysAlcoholInMonth), 
            p75 = quantile(HowManyDaysAlcoholInMonth, .75), 
            max = max(HowManyDaysAlcoholInMonth),
            .by = c(Year, Sex)) 
  

data_sex |> 
  gt::gt() |> 
    fmt_number(n_sigfig = 2)
Year Sex mean sd min p25 median p75 max
2017 Female 7.4 6.8 2.0 2.0 5.0 9.0 30
2017 Male 9.2 8.3 2.0 2.0 5.0 9.0 30
2019 Male 9.3 8.5 2.0 2.0 5.0 9.0 30
2019 Female 6.6 6.4 2.0 2.0 5.0 9.0 30
2021 Male 9.3 8.9 2.0 2.0 5.0 9.0 30
2021 Female 7.4 6.6 2.0 2.0 5.0 9.0 30

Second Attempt

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data_sex |> 
  ggplot(aes(x = Year, y = mean, fill = Sex, group = Sex)) +
  geom_bar(stat = "identity")

Third Attempt

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data_sex |> 
  ggplot(aes(x = Year, y = mean, fill = Sex, group = Sex)) +
  geom_bar(stat = "identity", position = "dodge")

Fourth Attempt

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data_sex|> 
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
# We changed the geom!
  geom_point() 

Fifth Attempt

Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021

data_sex|> 
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  geom_line()

🌟 What Gestalt Principle did we use here?

What are we missing from our checklist?

🔍 Include Sample Size in Title/Caption.

# library(ggrepel)
# library(geomtextpath)

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  geom_line() +
  labs(
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  )

🗑️ Cut Extra Information

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  geom_line() +
  labs(
    x = "",
    y = "",
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  )

🖌️ Remove Unnecessary Ink

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  geom_line() +
  labs(
    x = "",
    y = "",
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  ) +
  theme_minimal()

🌈 Use Mnemonic Colors (Color-Blind Friendly).

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  geom_line() +
  labs(
    x = "",
    y = "",
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  ) +
  theme_minimal() +
  scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc"))

🔖 Directly Label Categories.

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  labs(
    x = "",
    y = "",
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  ) +
  theme_minimal() +
  scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc")) +
  geom_text_repel(aes(label = round(mean, 1))) +
  geom_textline(aes(label = Sex), size = 5, hjust = 0.2) +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    plot.title.position = "plot"
  )

📏 Think about the Range of the Y Axis

data_sex |>
  ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
  geom_point() +
  labs(
    x = "",
    y = "",
    title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
    subtitle = "N = 3,348",
    caption = "`RiskExplorer` R package"
  ) +
  theme_minimal() +
  scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc")) +
  geom_text_repel(aes(label = round(mean, 1))) +
  geom_textline(aes(label = Sex), size = 5, hjust = 0.2) +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    plot.title.position = "plot"
  ) +
  scale_y_continuous(limits = c(2, 10))

The Final Plot

Resources:

Books

We’ve navigated through treacherous waters and found our treasure. Now, let’s set sail to new adventures. Until we meet again on the high seas!

Your Captains






Catalina Cañizares ccani007@fiu.edu

Francisco Cardozo foc9@miami.edu

Raymond Balise balise@miami.edu