Data Visualization

Behavioral Data Science Toolbox 2025

Lecturers: Annie Johansson and Lilian Ye

🏠

Slides made using quarto

Cover photo made in R by Danielle Navarro

📌

data-visualization-2025

Tuesday, Sept 30

  • Why visualize data?
  • What makes a plot good (and bad)?
  • Guiding principles
  • Data Visualization Project

Why visualize data? 👀

💬

Why visualize data?

Why visualize data?

Why visualize data?

Anscombe’s quartet

Source: Same stats, different graphs Source: Wikipedia

Code
library("datasauRus")
library(scales)

datasaurus_dozen %>% 
  ggplot2::ggplot(aes(x = x, y = y, color = dataset)) +
  ggplot2::geom_point() +
  ggplot2::theme_void() +
  ggplot2::geom_smooth(method = "lm", color = "gray", fill = "gray", alpha = .5) +
  ggplot2::theme(legend.position = "none", text = element_text(size = 30)) +
  ggplot2::facet_wrap(~dataset, ncol = 4)

💬

What is more important? An eye or an algorithm?

What are the consequences of overrelying on statistical techniques?
What are the consequences of overrelying on visualizations?

Bad plots 💩

What makes a bad plot bad?

  • Aesthetic (ugly)

  • Perceptual (bad)

  • Substantive (wrong)

💬 Speed dates

  • What do you think about this plot?

  • What elements can be improved?

  • Are the problems aesthetic, perceptual, or substantive?

💬

reddit.com/r/dataisugly

reddit.com/r/dataisugly
  • Misleading (or missing) information?
  • Many data visualizations might have a hidden agenda, due to e.g. marketing strategies.

💬

Data to ink ratio

💬

  • Do the colors make sense?

Do the colors make sense?

Do the colors make sense?

💬

  • Are all elements needed?

💬

  • Are the axes correct?

Is a zero-point needed?

Is a zero-point needed?

💬

  • Is it simple?

⚠️ Stacked bar charts ⚠️

⚠️ Pie charts ⚠️

Visualizing proportions

  • Pie chart, stacked bar, or side-by-side bars?

https://clauswilke.com/dataviz/visualizing-proportions.html#

Elementary Perceptual Tasks

Elementary Perceptual Tasks

💬

  • Does it support one conclusion?

Storytelling with data - show just one conclusion!

Exploratory versus explanatory

Source: Scott McCloud

Source: Scott McCloud
  • Exploratory: examine the structure of your data.

  • Explanatory: tell a story with your data.

Exploratory versus explanatory

Source: storytellingwithdata.com

Some good examples

Distributions are informative

NYT Graphic - Obamacare spending

How to reproduce a NYT graphic

Guiding principles 🪄

Tips for the best viz

  • Is it explaining data?

  • Is the information complete and correct?

  • Are axes correct? (+ Should they have a zero-point?)

  • Do the colors work? ( + Do they map to a relevant attribute?)

  • Are all elements needed?

  • What is the data to ink ratio?

  • Is it understandable & simple?

  • Does it portray one conclusion?

Break

Code
library("RXKCD")
RXKCD::getXKCD(which = "833")

Data Visualization Project 📊

Aims

Your goal is to go from exploring the Prowise data to telling a story with it.

  • Assignment 1: Choose a research question. Explore the data and create a set of visualizations to help you understand the data.

  • Assignment 2: Create a final data visualization that tells a story with the data. Compiled into a html document, and presented in a GitHub repository.

Requirements Assignment 2

  • Use R Markdown (or Quarto) to create your visualizations.
  • The markdown document should be in the style of a report: explain your research question, and how you came to your conclusion. Not more than 1000 words.
  • Include an exploratory graph, and explain how it helped you understand the data.
  • Only your final, explanatory graph, will be graded. Clearly mark which graph this is.

Workflow

  • Comment your code clearly. We want to understand your thought process.
  • Collaborate within your group using Git & GitHub. You will set up your own project repository connected to the Data Visualization Server.
  • Make sure all documents render correctly, and that code is styled and runs.
  • ️⚠ Do not push any data to GitHub.

Workflow

First draft & peer review

  • First draft due Friday 10/10 12.30 – No requirements.

  • Peer review due Friday 10/10 23.59 – Review one other group’s work. You will have time for this during Friday’s class.

🔗

Choose RQ

Interactive plots 🕹️

Friday, Oct 3

  • Interactive visualizations with plotly
  • Work on assignment 1

<<<<<<< HEAD

Some content

=======

Some conflicted content

>>>>>>> new_branch

Git conflicts

  • Happens when two people change the same file, and then try to push to GitHub.
  • Identify the content that you want to keep and delete the rest
  • Delete the <<<<<<<, =======, and >>>>>>> lines
  • Save the file, add, commit, and push again
  • Need to resolve via command line?
  • Avoid by always starting you work session with a git pull (and make sure you don’t have uncommitted changes)
  • Also communicate who is working on what

Interactivity 🎮

💬 Why use interactive visualizations?



What advantages/disadvantages do interactive visualizations have over static plots?

Advantages

  • Explore data dynamically
  • Zoom, pan, and hover for details
  • Reveal extra information without clutter
  • Enhance engagement in presentations and reports
  • Useful for exploratory analysis and dashboards

Disadvantages

  • Harder to include in static formats (papers, print)
  • Accessibility issues (screen readers, non colorblind-friendly defaults)
  • Can overwhelm the audience if overused
  • Larger file sizes, slower rendering
  • Requires additional packages/libraries (e.g., {plotly})

🧠 Interactivity enhances, but does not replace,
the fundamentals of good data visualization

Plotly



  • Free and open-source interactive graphing library
  • R users can access it via the {plotly} package: plotly.com/r

Plotly in R



Two main ways to creating a plotly object in R:

  • Convert a ggplot2 object with ggplotly()
  • Directly initialize a plotly object with plot_ly()

➡️ ggplotly() example with Oefenweb data

Additional functionalities in Plotly



With plot_ly(), you can add and control interactive elements such as:

Requirements Assignment 1, part B (individual)

  • Make the static ggplot object interactive using the {plotly} package
  • Add a customized tooltip that displays additional information when you hover over the plot

Better plots 🥇

Tuesday, Oct 7

  • Feedback on assignment 1
  • Recap: what makes a good DV?
  • More data visualization features
  • Work on assignment 2

Tips from Assignment 1

  • Use coherent color palettes (more on that today!)
  • Use informative titles (for example state your conclusion in the title / subtitle). Make it bold and clearly readable.
  • What is the added value of a boxplot or violin plot?
  • Think about sample size within your groupings and how to communicate its effect.
  • Go from exploratory to explanatory!
  • The audience should be able to infer the main conclusion without reading the description.

patchwork

  • Plots are combined with a +
  • Add a general title with plot_annotation(title = "...")
  • Add a general subtitle with plot_annotation(subtitle = "...")
  • Change the formatting with plot_annotation(theme = theme(plot.title = element_text(...)))
Code
library(patchwork)
# Two example plots
p1 <- ggplot(mtcars, aes(x = hp, y = mpg)) +
  geom_point(color = "steelblue", size = 3, alpha = 0.7) +
  theme_minimal() +
  labs(title = "Fuel efficiency vs Horsepower")

p2 <- ggplot(mtcars, aes(x = factor(cyl), y = mpg)) +
  geom_boxplot(fill = "steelblue", alpha = 0.7) +
  theme_minimal() +
  labs(title = "MPG by Number of Cylinders") +
  theme(legend.position = "none")

# Combine with patchwork and add title
combined <- p1 + p2 + 
  plot_annotation(
    title = "Analysis of mtcars dataset",
    subtitle = "Exploring relationships between engine features and fuel efficiency",
    caption = "Data source: mtcars",
    theme = theme(plot.title = element_text(face = "bold", size = 16),
                  plot.subtitle = element_text(size = 14))
  ) 

combined

One dataset, many visualizations

Code
library("gt")
data("pizzaplace")
pizza_top <- pizzaplace %>%
  mutate(size = factor(size, levels = c("S", "M", "L"))) %>%
  count(name, type, size, price, sort = TRUE) %>%
  top_n(n = 5)
pizza_top %>%
  gt() %>%
  tab_header(title = "Pizza Top 5", subtitle = "2015") %>%
  fmt_currency(columns = price, currency = "USD") %>%
  tab_source_note(source_note = md("Source: [pizzaplace dataset](https://gt.rstudio.com/articles/gt-datasets.html#pizzaplace)")) %>%
  opt_stylize(style = 6)
Pizza Top 5
2015
name type size price n
big_meat classic S $12.00 1914
thai_ckn chicken L $20.75 1410
five_cheese veggie L $18.50 1409
four_cheese veggie L $17.95 1316
classic_dlx classic M $16.00 1181

Source: pizzaplace dataset

Code
library("ggplot2")
pizza_top %>%
  ggplot(aes(x = reorder(name, n, decreasing = TRUE), y = n)) +
  geom_point(aes(color = type, size = size)) +
  geom_text(aes(label = price), nudge_y = -30) +
  labs(title = "Pizza Top 5", subtitle = "2015", x = "name") 

One dataset, many visualizations

Code
library("gt")
pizza_season <- pizzaplace %>%
  mutate(month = lubridate::month(date, label = TRUE)) %>%
  group_by(month) %>%
  count(type)
pizza_season %>%
  pivot_wider(names_from = month, values_from = n) %>%
  gt() %>%
  tab_header(title = "Pizza Season", subtitle = "2015") %>%
  tab_source_note(source_note = md("Source: [pizzaplace dataset](https://gt.rstudio.com/articles/gt-datasets.html#pizzaplace)")) %>%
  opt_stylize(style = 6)
Pizza Season
2015
type Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
chicken 913 875 994 924 939 910 963 934 900 832 981 885
classic 1257 1178 1236 1253 1324 1199 1331 1283 1202 1181 1262 1182
supreme 1044 964 991 1013 1045 1040 1041 991 877 998 1050 933
veggie 1018 944 1040 961 1020 958 1057 960 911 872 973 935

Source: pizzaplace dataset

Code
library("ggplot2")
library("lubridate")
fig_season_1 <- pizza_season %>%
  ggplot(aes(x = month, y = n, group = type)) +
  geom_bar(aes(fill = type), stat = "identity") +
  labs(title = "Pizza Season", subtitle = "2015", y = "Number of pizzas sold", x = "Month")
fig_season_1

Code
fig_season_2 <- pizza_season %>%
  ggplot(aes(x = month, y = n, group = type)) +
  geom_line(aes(linetype = type)) +
  labs(title = "Pizza Season", subtitle = "2015", y = "Number of pizzas sold", x = "Month")
fig_season_2

from data to viz

from data to viz

Recap

✓ Keep it simple.

✓ Don’t mislead.

✓ Tell one story.

First impressions matter!

Chart junk

NYT

NYT

Claus Wilke

Claus Wilke

NYT

NYT

You can try it yourself with geom_image() from ggimage

Code
library(ggimage)
# aggregate to totals
pizza_totals <- pizza_season %>%
  group_by(type) %>%
  summarise(total = sum(n), .groups = "drop") %>%
  mutate(n_icons = round(total / 1000))  # 1 pizza = 1000 sales

# expand data: one row per pizza icon
# this is a trick so that each pizza icon can be plotted separately (essentially using geom_point but appearing as a bar)
chart_junk_data <- pizza_totals %>%
  rowwise() %>%
  mutate(icon_id = list(1:n_icons)) %>%
  unnest(icon_id) 

# plot with 🍕
pizza_plot <- ggplot(chart_junk_data, aes(x = icon_id, y = type)) +
  geom_image(aes(image = "images/pizza.png"), size = 0.2) +
  # add white rectangle to cover up excess icons (taken from pizza_totals data)
  geom_rect(
    data = pizza_totals,
    aes(xmin = total/1000, xmax = total/1000 + 2, 
        ymin = as.numeric(factor(type)) - 0.5, 
        ymax = as.numeric(factor(type)) + 0.5),
    inherit.aes = FALSE,
    fill = "white"
  ) +
  labs(
    title = "Total Pizza Sales in 2015",
    x = "Sales (Scale 1:1000)",
    y = NULL
  ) +
  theme_minimal(base_size = 14) +
  theme(
    panel.grid.minor.y = element_blank(),
    panel.grid.major.y = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.ticks = element_blank()
  )

pizza_plot

Themes

Code
library("cowplot")
fig_season_2 +
  theme_cowplot()

Code
my_theme <- theme_cowplot() +
  theme(
    panel.grid.major = element_line(color = "gray90"),
    axis.ticks = element_line(color = "gray20"),
    axis.text = element_text(color = "gray20", face = "italic", size = 16),
    axis.title = element_text(color = "gray20", face = "bold", size = 16),
    plot.title = element_text(hjust = 0.5, size = 20, face = "bold"),
    plot.subtitle = element_text(hjust = 0.5, size = 18),
    legend.position = "top",
    legend.title = element_text(size = 16, face = "bold"),
    legend.text = element_text(size = 16)
  )
fig_season_2 + my_theme

Facets

  • theme_cowplot and theme_bw() format facets nicely.

  • Change facet format manually with theme(strip.text = element_text(...)) and theme(strip.background = element_rect(...)).

Code
fig_season_2 +
  facet_wrap(~type) +
  theme_bw(12) +
  theme(legend.position = "none")

Code
fig_season_2 +
  geom_line(linetype = "solid", color = "gray30", linewidth = 0.5) +
  facet_wrap(~type) +
  theme_minimal(12) +
  theme(legend.position = "none",
        strip.text = element_text(face = "bold", color = "plum4"),
        strip.background = element_rect(fill = "thistle2", color = NA)) # fill for background; color for border

Colors

Code
fig_quarter <- pizza_season %>%
  mutate(quarter = case_when(
    month %in% c("Jan", "Feb", "Mar") ~ "Q1",
    month %in% c("Apr", "May", "Jun") ~ "Q2",
    month %in% c("Jul", "Aug", "Sep") ~ "Q3",
    month %in% c("Oct", "Nov", "Dec") ~ "Q4"
  )) %>%
  ggplot(aes(x = quarter, y = n, group = type)) +
  geom_bar(aes(fill = type), stat = "identity", position = "dodge") +
  labs(title = "Pizza Season", subtitle = "2015, split by quarter", y = "Number of pizzas sold", x = "")

fig_quarter +
  scale_fill_viridis_d()

Color scales

Code
fig_quarter +
  labs(subtitle = "Qualitatitive Color Scale") +
  scale_fill_brewer(type = "qual")

qualitative
(categorical data)

Code
fig_quarter +
  labs(subtitle = "Sequential Color Scale") +
  scale_fill_brewer(type = "seq")

sequential
(ordered data that progress from low to high)

Code
fig_quarter +
  labs(subtitle = "Diverging Color Scale") +
  scale_fill_brewer(type = "div")

diverging
(ordered data that progress from low to high with a critical midpoint, e.g., 0)

Color blindness

Code
# remotes::install_github("clauswilke/colorblindr")
library("colorblindr")
colorblindr::cvd_grid(fig_quarter)

The package MetBrewer has many colorblind-friendly palettes:

Code
library("MetBrewer")
MetBrewer::colorblind_palettes
 [1] "Archambault" "Cassatt1"    "Cassatt2"    "Demuth"      "Derain"     
 [6] "Egypt"       "Greek"       "Hiroshige"   "Hokusai2"    "Hokusai3"   
[11] "Ingres"      "Isfahan1"    "Isfahan2"    "Java"        "Johnson"    
[16] "Kandinsky"   "Morgenstern" "OKeeffe1"    "OKeeffe2"    "Pillement"  
[21] "Tam"         "Troy"        "VanGogh3"    "Veronese"   
Code
fig_quarter +scale_fill_manual(values = MetBrewer::met.brewer("VanGogh3", n = 4))

Color accuracy

Print-proof, monitor/beamer-proof, colorblind-proof?

Source: benq.com

Source: benq.com

Captions

Title Descriptive or declarative

Methods Keep it brief

Results If not (fully captured) in title

Definitions Colors, line types, error bars, etc.

Data source If external

Source: sketch.es

Source: sketch.es

Typography

Change fonts with package showtext and function font_add_google(). Browse Google Fonts

Code
library("showtext")
sysfonts::font_add_google("Lekton")
showtext::showtext_auto()

fig_season_2 +
  theme_cowplot() +
  theme(text = element_text(family = "Lekton", size = 20))

Consider your audience

Code
pizza_plot

Consider your audience

Code
sysfonts::font_add_google("Lobster")
sysfonts::font_add_google("Lexend")
showtext::showtext_auto()

pizza_plot +
  # add a geom_rect element with opacity on top of pizza bars, to highlight the biggest bar
  geom_rect(
    data = pizza_totals %>%
      mutate(highlight = ifelse(type == "classic", TRUE, FALSE)),
    aes(xmin = 0, xmax = total/1000, 
        ymin = as.numeric(factor(type)) - 0.5, 
        ymax = as.numeric(factor(type)) + 0.5,
        alpha = highlight),
    inherit.aes = FALSE,
    fill = "white"
  ) +
  scale_alpha_manual(values = c("TRUE" = 0, "FALSE" = 0.3), guide = "none") +
  labs(title = "The classic is a classic for a reason.",
       subtitle = "Total Pizza Sales in 2015",
       caption = "Source: pizzaplace dataset") +
  # change fonts
  theme(text = element_text(family = "Lexend", color = "gray30"),
                 plot.title = element_text(family = "Lobster", size = 20, color = "tomato3"),
                 plot.subtitle = element_text(family = "Lexend", size = 16), 
                 axis.title.x = element_text(family = "Lobster", size = 16),
                 axis.text.y = element_text(family = "Lobster", size = 16, color = "tomato3"))

Consider your audience

Code
max_sales <- max(pizza_totals$total)
best_seller <- pizza_totals$type[pizza_totals$total == max_sales]

plot_data <- pizza_totals %>%
  # Create a logical column to highlight the best seller
  mutate(is_max = total == max_sales) %>%
  # Order the bars by total sales (optional, but professional practice)
  mutate(type = forcats::fct_reorder(type, total, .desc = TRUE))

# Define the subtitle
main_conclusion <- paste0(
  "The '", best_seller, "' pizza is the clear leader with ", 
  scales::comma(max_sales), " units sold."
)

pizza_plot2 <- ggplot(plot_data, aes(x = type, y = total, fill = is_max)) +
  geom_col(width = 0.7) +
  # Add labels above the bars
  geom_text(
    aes(label = scales::comma(total)),
    vjust = -0.5, # Position the text slightly above the bar
    size = 4, 
    fontface = "bold"
  ) +
  
  # Apply manual colors: Highlight color for TRUE, Neutral gray for FALSE
  scale_fill_manual(
    values = c("TRUE" = "tomato3", "FALSE" = "grey70"), # Red for highlight
    guide = "none" # Remove the legend for the fill color
  ) +
  
  # Customize Titles and Labels
  labs(
    title = "Total Sales by Pizza Type",
    subtitle = main_conclusion,
    x = "Pizza Type",
    y = "Total Units Sold"
  ) +
  
  # Apply a clean cowplot theme
  theme_cowplot() +
  
  # Remove y-axis clutter)
  theme(
    # Align and style title/subtitle
    plot.title.position = "plot",
    plot.title = element_text(size = 16, face = "bold"),
    plot.subtitle = element_text(size = 14, color = "grey30"),
    
    # Hide the y-axis line, ticks, and label as the data is already on the bars
    axis.line.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.text.y = element_blank(),
    axis.title.y = element_blank(),
    
    # Add a faint grid line (a cowplot feature)
    panel.grid.major.y = element_line(color = "grey90", linetype = "dashed")
  ) +
  
  # Ensure there is enough room for the labels on top
  scale_y_continuous(expand = expansion(mult = c(0, 0.15)))

# Print the final plot
pizza_plot2

File format/size

  • File size: email attachment, webpage/image load time, compilation time
  • File format: resizing vector vs. bitmap/raster. For bitmap images, set the plot resolution: dpi = c(“retina”, “print”, “screen”)
ggplot2::ggsave("awesome_plot.png",
                width = 5,
                height = 5,
                units = "cm",
                dpi = "retina")

Source: clauswilke.com

Source: clauswilke.com

Font embedding

Vector images pick the closest font available (if the actual font is not available on the recipients computer). You can embed fonts into the vector image.

Adobe Acrobat (paid version) can be used to manually embed fonts in a PDF.

Continue learning

Get inspiration

❗️ A note on Friday’s class

  • Hand in your first draft before Friday 12.00.
  • Peer review is done in class.
  • You will get one-on-one feedback from us; this is your last chance!

Attendance is expected!

Q&A

Final report formatting

  • You do not need display all your code
  • Hide code with echo = FALSE
  • or in your pre-amble:
---
title: "example rmd"
author: "Annie Johansson"
date: "`r Sys.Date()`"
output: 
  html_document:
    code-folding: hide
---
  • or in your code chunk:
#| echo: true
#| code-fold: true

# your code here 
Code
# your code here 

Friday, Oct 10

🔗 Peer Review on Canvas

Grading

  • Assignment 1: 33% of your final grade
  • Assignment 2: 66% of your final grade
  • Final grade for data viz counts 30% into your final BDS Toolbox grade & needs to be \(\ge\) 5.5