ggplot2
June 5, 2024
Session 2 -
ggplot2
© 2024 by Catalina Canizares, Francisco Cardozo, and Raymond Balise is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
This material is freely available under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
For more information on this license, please visit: Creative Commons License
I have 4 different data sets with 11 observations each and 2 variables:
Uncovering Patterns: Data visualization helps in identifying and understanding complex patterns and relationships in data.
Simplifying the Complex: It transforms intricate data into a format that’s easier to grasp, effectively narrating a story.
Engaging Presentation: Visual representations are more visually appealing and engaging for the audience.
Effective Communication: Enables clear and concise communication of information, facilitating better understanding and decision-making.
Things that are spatially near to one another seem to be related.
Things that look alike seem to be related.
Our brains tend to ignore gaps and complete structures with open areas
Things that are visually tied to one another seem to be related.
Some objects in our visual field are easier to see than others.
Pop-out makes some things on a data graphic easier to see or find than others.
Find the blue circle:
Next Step: Perfecting Graph Construction
Graphical excellence is the well-designed presentation of interesting data (Tufte, 1983, p 51)
What are the common sources from which adolescents obtain the alcohol they consume?
Perspective Distortion
Inconsistent Baselines
Visual Complexity
Distraction from the Data
What is the distribution of family-provided alcohol to adolescents across different racial groups?
Difficulty in Comparing Sections
Ineffective for Large Number of Categories
Reliance on Color or Patterns
Area Perception Issues
Difficulty in Reading Exact Values
This chart shows the average number of Facebook likes on posts by pages of the political left. The point of this chart was to show the disparity between Mr Corbyn’s posts and others
https://ig.ft.com/science-of-charts/
In the 1980s, Cleveland and McGill ran experiments where participants estimated and compared values in charts
The overall pattern of results seems clear: performance worsening substantially as we move away from comparison on a common scale to length-based comparisons to angles and finally areas
These findings strongly suggest that there are better and worse ways of visually representing data when estimating and comparing values within the graph.
🚫💭 Anything that makes a viewer need to think is bad!
📊 3D Charts: Misleading and Overcomplicated
🥧 Pie Charts: Hard to Compare Accurately
🍩 Donut Charts: Stylish, Yet Less Functional
📚 Stacked Graphics: Can Obscure Data Details
🖋️ Extra Ink: Clutters and Confuses
🔑 Key/Legend: Necessary but Keep It Simple
🔍 Include Sample Size in Title/Caption.
🗑️ Cut Extra Information.
🖌️ Remove Unnecessary Ink.
💡 Highlight Key Points.
🌈 Use Mnemonic Colors (Color-Blind Friendly).
🔖 Directly Label Categories.
📏 Think about the Range of the Y Axis
We consulted six distinct journals, extracting 23 random articles from sections highlighting the most cited, most downloaded, and most recently published works.
ggplot2
ggplot2
ggplot2
ggplot2
is an implementation of Leland Wilkinson’s Grammar of Graphics—a general scheme for data visualizationggplot
Component | Function | Explanation |
---|---|---|
Data | ggplot(data) | The raw data that you want to visualize. |
Aesthetics | aes() | Links your data to how it’s shown on the graph. |
Geometries | geom_*() | The geometric shape of a layer representing the data. |
ggplot
ggplot2
Showcaseggplot2
Showcaseggplot2
Showcaseggplot2
ShowcaseWhere do tornado outbreaks usually occur?
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Or click here
RiskExplorer
youthAlcoholUse | ||||||
3385 rows x 12 cols | ||||||
Column | Plot Overview | Missing | Mean | Median | SD | |
---|---|---|---|---|---|---|
Year2017, 2019 and 2021 |
0.0% | — | — | — | ||
Age | 0.8% | 16.5 | 17.0 | 1.2 | ||
SexFemale and Male |
1.1% | — | — | — | ||
Grade12, 11, 10 and 9 |
1.2% | — | — | — | ||
RaceWhite, Multiple-Hispanic, Hispanic/Latino, Black or African American, Multiple-Non-Hispanic, Asian, Am Indian/Alaska Native and Native Hawaiian/Other PI |
1.6% | — | — | — | ||
SexOrientationHeterosexual, Bisexual, Not sure and Gay or Lesbian |
6.3% | — | — | — | ||
TimesPhysicalFight | 1.6% | 1.4 | 0.0 | 2.8 | ||
AgeFirstAlcohol | 0.0% | 12.7 | 13.0 | 2.5 | ||
HowManyDaysAlcoholInMonth | 0.0% | 8.1 | 5.0 | 7.6 | ||
SourceAlcohol6, 5, 8, 7, 2, 3 and 4 |
4.2% | — | — | — | ||
DaysOfBingeDrinking | 0.0% | 4.5 | 2.0 | 5.1 | ||
LargestNumberOfDrinks | 0.0% | 7.1 | 7.0 | 2.2 |
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Year | mean |
---|---|
2017 | 8.205185 |
2019 | 7.824135 |
2021 | 8.263975 |
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
data_sex <-
youthAlcoholUse |>
filter(!is.na(Sex)) |>
summarise(mean = mean(HowManyDaysAlcoholInMonth),
sd = sd(HowManyDaysAlcoholInMonth),
min = min(HowManyDaysAlcoholInMonth),
p25 = quantile(HowManyDaysAlcoholInMonth, .25),
median = median(HowManyDaysAlcoholInMonth),
p75 = quantile(HowManyDaysAlcoholInMonth, .75),
max = max(HowManyDaysAlcoholInMonth),
.by = c(Year, Sex))
data_sex |>
gt::gt() |>
fmt_number(n_sigfig = 2)
Year | Sex | mean | sd | min | p25 | median | p75 | max |
---|---|---|---|---|---|---|---|---|
2017 | Female | 7.4 | 6.8 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
2017 | Male | 9.2 | 8.3 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
2019 | Male | 9.3 | 8.5 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
2019 | Female | 6.6 | 6.4 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
2021 | Male | 9.3 | 8.9 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
2021 | Female | 7.4 | 6.6 | 2.0 | 2.0 | 5.0 | 9.0 | 30 |
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
Examine the average number of alcoholic drinks consumed by adolescents, differentiated by sex, for the years 2017, 2019, and 2021
🌟 What Gestalt Principle did we use here?
data_sex |>
ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
geom_point() +
geom_line() +
labs(
x = "",
y = "",
title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
subtitle = "N = 3,348",
caption = "`RiskExplorer` R package"
) +
theme_minimal() +
scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc"))
data_sex |>
ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
geom_point() +
labs(
x = "",
y = "",
title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
subtitle = "N = 3,348",
caption = "`RiskExplorer` R package"
) +
theme_minimal() +
scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc")) +
geom_text_repel(aes(label = round(mean, 1))) +
geom_textline(aes(label = Sex), size = 5, hjust = 0.2) +
theme(
legend.position = "none",
panel.grid.major = element_blank(),
plot.title.position = "plot"
)
data_sex |>
ggplot(aes(x = Year, y = mean, color = Sex, group = Sex)) +
geom_point() +
labs(
x = "",
y = "",
title = "Average Number of Alcoholic Drinks Consumed by \n Boys and Girls",
subtitle = "N = 3,348",
caption = "`RiskExplorer` R package"
) +
theme_minimal() +
scale_color_manual(values = c(Female = "#c90076", Male = "#2986cc")) +
geom_text_repel(aes(label = round(mean, 1))) +
geom_textline(aes(label = Sex), size = 5, hjust = 0.2) +
theme(
legend.position = "none",
panel.grid.major = element_blank(),
plot.title.position = "plot"
) +
scale_y_continuous(limits = c(2, 10))
Catalina Cañizares ccani007@fiu.edu
Francisco Cardozo foc9@miami.edu
Raymond Balise balise@miami.edu