Introduction

A really valuable exercise in learning about data visualization with ggplot is finding high quality graphics found in popular newspapers or research agencies and attempting to recreate them from scratch. I was inspired by Dr. Patrick Schloss at the University of Michigan doing the same on his Riffomonas Project YouTube channel, and as my second blog post, wanted to contribute my own example. I hope to give those new to ggplot the confidence to create publication-worthy graphics, and what better way to start than by mimicking the experts from the newspaper.

I decided to replicate a graph from a New York Times article on the decline of math and reading scores in the United States, among 8th and 4th graders, respectively. Below I display the graphic, produced by Francesca Paris using data from the National Assessment of Educational Progress (NAEP).

The structure of this document is as follows. I’m going to start with a basic attempt, and then in several steps, ramp up the complexity until we get a plot that is nearly identical to the original graphic. At each step, I’m going to show you the code, with comments highlighting the main additions relative to the prior version. If you just want the entire code in one place and without my comments, skip to the very end. You can also click the “Show All Code” or “Hide All Code” button in the top right of this page if you have a preference.

Getting Started

I have obtained the data from the NAEP, but it wasn’t in a tidy form. I did some data cleaning and saved it as a clean .csv file you can access from this URL or by downloading it from my GitHub. Since this post isn’t about data wrangling but data visualization, I will spare you the details.

Proceeding to our first visual, we see that our tibble has columns Year, percentile (10, 25, 50, 75, or 90), score, and subject. We will use the geom_line() and geom_point() geometries, with aesthetics x = Year, y = score, and group = percentile, and lastly we facet by subject (i.e., a separate plot per value of subject). Well, let’s pause for a moment. We could facet by subject, but I realized later when creating this post that we lose the ability to easily set $y$-limits for each faceted panel separately. Instead, we will create two separate plot and put them together using the cowplot package. For now, let us just focus on the subject = math plot, since the code is the same. While we’re loading in packages, I’m going to import the Libre Franklin font from Google Fonts as that resembles the font used by the New York Times. We’ll actually change the fonts at the end.

Code

library(tidyverse)
library(ggtext) # Formats markdown in plot text
library(glue) # String interpolation
library(showtext) # Changing plot font
library(cowplot) # Combining ggplots in a grid

font_add_google(name = 'Libre Franklin', family = 'franklin')

# Access on Github
filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv'

df = read_csv(filename) |> 
  filter(Year >= 2000) |> 
  # focus on math for now
  filter(subject == 'math')

plot = df |> 
  ggplot(aes(x = Year, 
             y = score,
             group = percentile # one line per percentile
  )) + 
  geom_point() + 
  geom_line() 

plot

Already, we have made a great start! Let’s take care of some low hanging fruit. We don’t need the $x$-axis label, and the $y$-axis label information will be present in the title. The title will be formatted with markdown using the ggtext package.

Let’s reduce the unnecessary theme elements by adding theme_minimal(), including the grey background. Then we will remove the vertical grid-lines and the minor horizontal ones using the panel.grid.* arguments to theme(). We will also add back tick marks which were removed by theme_minimal().

Code

plot = df |> 
  ggplot(aes(x = Year, 
             y = score, 
             group = percentile)) + 
  geom_point() + 
  geom_line() + 
  # Unnecessary theme-ing removed
  theme_minimal() +
  labs(x = NULL, 
       y = NULL,
       title = "**Math** scores for **8th graders**") +
  theme(# format facet labels with markdown
        plot.title = element_markdown(),  
        # Now remove a bunch of grid-lines
        panel.grid.major.x = element_blank(), 
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        # add back x axis ticks and change their length
        axis.ticks.x.bottom = element_line(),
        axis.ticks.length.x.bottom = unit(0.2, "cm"))

plot

Cleaning up the background, grid lines, and ticks.

Next we need to add some color. The author chose to place the lowest and highest deciles in a different color than the rest. Thus, we will create a binary indicator column reflecting this grouping, prior to our ggplot. After that, we can add a color aesthetic according to this grouping. Using the ‘Eyedropper’ tool in Mozilla Firefox, I determined the author used colors ‘#b35f57’ and ‘#aaaaaa’. We will use scale_color_manual() to implement this new color scale. The legend is unnecessary, so we remove it using guides().

Code

plot = df |> 
  mutate(is_extreme_score = 
           ifelse(percentile == '10' | percentile == '90',
                  "yes", "no")) |> 
  ggplot(aes(x = Year,
             y = score, 
             group = percentile,
             color = is_extreme_score)) + 
  geom_point() + 
  geom_line() + 
  theme_minimal() +
  labs(x = NULL, 
       y = NULL,
       title = "**Math** scores for **8th graders**") +
  theme(plot.title = element_markdown(),  
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.ticks.x.bottom = element_line(),
        axis.ticks.length.x.bottom = unit(0.2, "cm")) +
  scale_color_manual(
    # Specify breaks explicitly so we get colors in the right order
    breaks = c('yes', 'no'), 
    values = c('#b35f57', '#aaaaaa')) +
  guides(color = 'none') # unnecessary legend

plot

We now add the labels. This will require creating another column pretty_label in our dataset with the label. This will require some case-work with the case_when() function. Since we only want a label by the 2024 dot, we need to set the label as NA for all other years. The top and bottom percentiles have their own unique label, and for the middle percentiles, we can do some string interpolation using the glue package. After that, we can call the geom_richtext() function, which is an extension of geom_text() from the ggtext package to allow for Markdown formatting. In particular, we bold the extremes and add a line break with  . We also want the labels to the right, which we set using hjust.

Code

plot = df |> 
  mutate(is_extreme_score = ifelse(percentile == '10' | percentile == '90',
                                   "yes", "no"),
         pretty_label = case_when(
           percentile == 90 & Year == 2024 ~ "**Top<br>scorers**",
           percentile == 10 & Year == 2024 ~ "**Lowest<br> scorers**",
           (percentile >= 25 | percentile <= 90) & Year == 2024 ~ 
             glue("{percentile}th<br>percentile"),
           .default = NA_character_
         )) |> 
  ggplot(aes(x = Year,
             y = score, 
             group = percentile,
             color = is_extreme_score)) + 
  geom_point() + 
  geom_line() + 
  geom_richtext(aes(label = pretty_label),
                fill = NA, # text box should be transparent
                na.rm = TRUE, 
                label.color = NA, # Remove the box outline
                hjust = 0 # re-position text to right
                ) +
  theme_minimal() +
  labs(x = NULL, 
       y = NULL,
       title = "**Math** scores for **8th graders**") +
  theme(plot.title = element_markdown(),  
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.ticks.x.bottom = element_line(),
        axis.ticks.length.x.bottom = unit(0.2, "cm")) +
  scale_color_manual(breaks = c('yes', 'no'),
                     values = c('#b35f57', '#aaaaaa')) +
  guides(color = 'none') 

plot

We need to fix the limits, margins, and breaks of the axes. For example, we want to prevent plot items like the text from being clipped by the panel margins, and instead be clipped by the plot boundaries itself.To do so, set clip = F with the coord_cartesian() function. While we’re at it, we can set expand = F to avoid the unnecessary expansion ggplot adds by default to the scale. Instead, we manually set the $x$ and $y$ limits with the scale_*_continuous() functions. It was at this point I realized why facet_wrap would fail: I needed to manually set the axes limits for each faceted plot separately, since math and reading are on different scales. The solution is to just make two separate ggplot objects and combine with the cowplot package later.

In addition to changing the limits, we can control the exact axis ticks that appear and their labels using the breaks and label argument of the scale_x_continuous() function. This requires some tedious relabeling of the axes.

Even after this, we still need more space though. So we will directly set the plot.margin argument to give more space to the right.

Code

plot = df |> 
  mutate(is_extreme_score = ifelse(percentile == '10' | percentile == '90',
                                   "yes", "no"),
         pretty_label = case_when(
           percentile == 90 & Year == 2024 ~ "**Top<br>scorers**",
           percentile == 10 & Year == 2024 ~ "**Lowest<br> scorers**",
           (percentile >= 25 | percentile <= 90) & Year == 2024 ~ 
             glue("{percentile}th<br>percentile"),
           .default = NA_character_
         )) |> 
  ggplot(aes(x = Year,
             y = score, 
             group = percentile,
             color = is_extreme_score)) + 
  geom_point() + 
  geom_line() + 
  geom_richtext(aes(label = pretty_label),
                fill = NA, na.rm = TRUE, label.color = NA,
                hjust = 0) +
  theme_minimal() +
  labs(x = NULL, 
       y = NULL,
       title = "**Math** scores for **8th graders**") +
  theme(plot.title = element_markdown(),  
        panel.grid.major.x = element_blank(),
        panel.grid.minor.x = element_blank(),
        panel.grid.minor.y = element_blank(),
        axis.ticks.x.bottom = element_line(),
        axis.ticks.length.x.bottom = unit(0.2, "cm"),
        plot.margin = margin(0.5,2,0.5,0.5, "cm") # Add space on right of plot
        ) +
  scale_color_manual(breaks = c('yes', 'no'),
                     values = c('#b35f57', '#aaaaaa')) +
  guides(color = 'none') +
  coord_cartesian(expand = F, clip = 'off') +
  scale_y_continuous(limits = c(210, 340), # Where to start and end the y axis
                     breaks = seq(220, 320, by = 20) # Where to put tick marks
                     # Note, no labels argument necessary because we literally
                     # want to show the integer as the label
  ) +
  scale_x_continuous(
    # where to start and end the x axis
    limits = c(2000, 2024), 
    # where the tick marks belong
    breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024),
    # what to label the tick marks that we picked using breaks
    labels = c("'00", "'03", "'07", "'11", "'15","'19", "'24")
  )

plot

Fixing axis limits, clipping, and axis labels.

Handling two plots at a time

This looks close to completion for the math version. Let’s now write a function plot_generator() that makes the math or reading version of the plot. The function will take in the original tibble with both subjects included as well as the desired subject and output the specified plot. The function first does the necessary filtering and creation of new columns. The filtering part is a bit complicated, since we want to insert the subject argument into a dplyr function. This requires so called ‘tidy evaluation’ using the !! injection operator function from the rlang package. After subsetting and mutating the tibble, the function returns the plot.

The only components of the plot that are unique to each subject are the arguments to scale_y_continuous() and the title, so we’ll define those conditional on the subject. After that, we can generate the combined plot by calling the function plot_generator() twice and then using the simple cowplot library syntax. We can add in the caption using the plot_annotation() function.

Code

plot_generator = function(df, subject) {
  if (subject == 'math') {
    plot_title = "**Math** scores for **8th graders**"
    y_limits = c(210, 340)
    y_breaks = seq(220, 320, by = 20)
  }
  else if (subject == 'reading') {
    plot_title = "**Reading** scores for **4th graders**"
    y_limits = c(150, 270)
    y_breaks = seq(160, 260, by = 20)
  }
  else(
    stop("Pass in either 'math' or 'reading' as an argument for subject.")
  )
  
  df_subset = df |> filter(Year >= 2000) |>
    # Tricky note: Use rlang syntax for tidy evaluation  
    filter(subject == !!subject) |> 
    mutate(
      is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"),
      pretty_label = case_when(
        percentile == 90 & Year == 2024 ~ "**Top<br>scorers**",
        percentile == 10 &
          Year == 2024 ~ "**Lowest<br> scorers**",
        (percentile >= 25 | percentile <= 90) & Year == 2024 ~
          glue("{percentile}th<br>percentile"),
        .default = NA_character_
      )
    ) 
  
  plot = df_subset |> ggplot(aes(
    x = Year,
    y = score,
    group = percentile,
    color = is_extreme_score
  )) +
    geom_point() +
    geom_line() +
    geom_richtext(
      aes(label = pretty_label),
      fill = NA,
      na.rm = TRUE,
      label.color = NA,
      hjust = 0
    ) +
    theme_minimal() +
    labs(x = NULL, y = NULL, title = plot_title) +
    theme(
      plot.title = element_markdown(),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank(),
      panel.grid.minor.y = element_blank(),
      axis.ticks.x.bottom = element_line(),
      axis.ticks.length.x.bottom = unit(0.2, "cm"),
      plot.margin = margin(0.5, 2, 0.5, 0.5, "cm")
    ) +
    scale_color_manual(breaks = c('yes', 'no'),
                       values = c('#b35f57', '#aaaaaa')) +
    guides(color = 'none') +
    coord_cartesian(expand = F, clip = 'off') +
    scale_y_continuous(limits = y_limits, breaks = y_breaks) +
    scale_x_continuous(
      limits = c(2000, 2024),
      breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024),
      labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24")
    )
  
  return(plot)
}

Now let’s show a workflow to generate the plot. The placement of the individual plots in the cowplot can be modified using the draw_plot() arguments, which use relative scaling values (e.g., 0.15 means 15%). We can also add the caption with draw_label(). The caption also has a grey dot that I couldn’t replicate since the draw_label() function doesn’t support HTML customization (it is basically a wrapper for geom_label(), not something from the ggtext package). For similar reasons, I could not add the hyperlinks.

Code

# Access on GitHub (see beginning of post for link)
filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv'
df = read_csv(filename)

plot_math = plot_generator(df, 'math')
plot_reading = plot_generator(df, 'reading')

caption_text <- paste(
  "Top scorers shown are at the 90th percentile; lowest scorers are at the 10th.",
  "Scores are from the National Assessment of Educational Progress,",
  "which tests a national sample of students to track educational achievement.",
  "Source: NAEP. By Francesca Paris. Recreated by Akshay Prasadan.",
  sep = "\n"
)

# Combine with cowplot
plot <- ggdraw() +
  draw_plot(plot_math, x = 0, y = 0.10, width = 0.5, height = 0.85) +
  draw_plot(plot_reading, x = 0.5, y = 0.10, width = 0.5, height = 0.85) +
  draw_label(caption_text,
             x = 0.03, y = 0.016, hjust = 0, vjust = 0,
             size = 9, lineheight = 1.2,
             fontfamily = 'franklin',
             fontface = "plain", color = 'grey40')

plot

Wow! That’s nearly perfect! It’s time for the final batch of editing. I find it easiest to first fix a size, and then save your plot using ggsave with those precise dimensions. Then, I fine-tune the margins or font sizes until it looks appropriate for that fixed dimension. If you rely on RStudio’s plotting window, then the sizes will vary depending on your zoom level or the size of the window on your monitor. This is not reproducible.

Recall I imported the Libre Franklin font, which is an approximation of the NYT’s font for graphics. Now I’m going to actually apply that font. After that I will make several minor sizing tweaks. This part isn’t very interesting, but I’ll comment my main changes.

Code

showtext_opts(dpi = 300)
showtext_auto()

plot_generator_final = function(df, subject) {
  if (subject == 'math') {
    plot_title = "**Math** scores for **8th graders**"
    y_limits = c(210, 340)
    y_breaks = seq(220, 320, by = 20)
  }
  else if (subject == 'reading') {
    plot_title = "**Reading** scores for **4th graders**"
    y_limits = c(150, 270)
    y_breaks = seq(160, 260, by = 20)
  }
  else(
    stop("Pass in either 'math' or 'reading' as an argument for subject.")
  )
  
  df_subset = df |> filter(Year >= 2000) |>
    filter(subject == !!subject) |> 
    mutate(
      is_extreme_score = ifelse(percentile == '10' | 
                                  percentile == '90', "yes", "no"),
      pretty_label = case_when(
        percentile == 90 & Year == 2024 ~ "**Top<br>scorers**",
        percentile == 10 &
          Year == 2024 ~ "**Lowest<br> scorers**",
        (percentile >= 25 | percentile <= 90) & Year == 2024 ~
          glue("{percentile}th<br>percentile"),
        .default = NA_character_
      )
    ) 
  
  plot = df_subset |> ggplot(aes(
    x = Year,
    y = score,
    group = percentile,
    color = is_extreme_score
  )) +
    geom_point() +
    geom_line() +
    geom_richtext(
      aes(label = pretty_label),
      fill = NA,
      na.rm = TRUE,
      label.color = NA,
      hjust = 0,
      size = 4, # Make label font size close to title font (units are weird)
      lineheight = 0.75 # Reduce line spacing of labels
    ) +
    theme_minimal() +
    labs(x = NULL, 
         y = NULL, 
         title = plot_title) +
    theme(
      text = element_text(family = 'franklin'),
      plot.title = element_textbox_simple(size = 12, 
                                          width= NULL, 
                                          # Prevent line break in titles
                                          padding = margin(0,0,10,0)
      ),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank(),
      panel.grid.minor.y = element_blank(),
      axis.ticks.x.bottom = element_line(),
      axis.ticks.length.x.bottom = unit(0.2, "cm"),
      plot.margin = margin(0, 2, 0.5, 0.5, "cm"),
      # Refine axis tick label sizes, i.e., the "'00", "'03", etc.
      axis.text.x = element_text(size = 8, family = 'franklin'),
      axis.text.y = element_text(size = 8, hjust = 1,
                                 margin = margin(0, 6, 0, 0)),
      # Make sure plot title starts at plot edge (left-most boundary), 
      # not panel edge (panel = subset of plot)
      plot.title.position = 'plot'
    ) +
    scale_color_manual(breaks = c('yes', 'no'),
                       values = c('#b35f57', '#aaaaaa')) +
    guides(color = 'none') +
    coord_cartesian(expand = F, clip = 'off') +
    scale_y_continuous(limits = y_limits, breaks = y_breaks) +
    scale_x_continuous(
      limits = c(2000, 2024),
      breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024),
      labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24")
    )
  
  return(plot)
}


plot_math_final = plot_generator_final(df, 'math')
plot_reading_final = plot_generator_final(df, 'reading')

plot_final <- ggdraw() +
  draw_plot(plot_math_final, x = 0, y = 0.12, width = 0.5, height = 0.85) +
  draw_plot(plot_reading_final, x = 0.5, y = 0.12, width = 0.5, height = 0.85) +
  draw_label(caption_text,
             x = 0.03, y = 0.015, hjust = 0, vjust = 0,
             size = 9, lineheight = 1.2,
             fontfamily = 'franklin', fontface = 'plain', color = 'grey40')

# Save
ggsave("replicated_nyt.png", plot_final, width = 6.6, height = 6.1, dpi = 300, bg = 'white')

Looks pretty good, right? Let me repost below the original.

Final Remarks

A careful inspection reveals many additional deficiencies in my replication attempt. For example, it seems the author actually used alpha as an aesthetic for the column we called is_extreme_score, since the percentiles seem to change transparency for the 25th, 50th, and 75th levels. The fonts seem a bit thicker too for the extremes. It would probably be easiest to just manually annotate the plot for finer font control. At this point, I feel like I’m giving up reproducibility, D.R.Y. code, and other programming tenets, but haven’t I turned from a programmer into a graphic designer at this point anyway?

Moreover, I still do not fully understand the best way to fine-tune margins of the various plot objects, since practically every theme argument has its own margin argument. So the spacing certainly doesn’t match the original. At a certain point, I basically gave up out of frustration.

But this is an exercise in learning ggplot, not patience. My central goal was to prove that I, and more importantly, you, the reader, can build high quality graphics that belong in leading newspaper publications. Granted, all the credit goes to the New York Times and Francesca Paris for their original design. I look forward to some future exercises like this, perhaps with The Economist, Bloomberg, or Financial Times instead.

The complete code from start to finish

Code

library(tidyverse)
library(ggtext) 
library(glue)
library(showtext) 
library(cowplot) 

font_add_google(name = 'Libre Franklin', family = 'franklin')
showtext_opts(dpi = 300)
showtext_auto()

plot_generator_final = function(df, subject) {
  if (subject == 'math') {
    plot_title = "**Math** scores for **8th graders**"
    y_limits = c(210, 340)
    y_breaks = seq(220, 320, by = 20)
  }
  else if (subject == 'reading') {
    plot_title = "**Reading** scores for **4th graders**"
    y_limits = c(150, 270)
    y_breaks = seq(160, 260, by = 20)
  }
  else(
    stop("Pass in either 'math' or 'reading' as an argument for subject.")
  )
  
  df_subset = df |> filter(Year >= 2000) |>
    filter(subject == !!subject) |> 
    mutate(
      is_extreme_score = ifelse(percentile == '10' | 
                                  percentile == '90', "yes", "no"),
      pretty_label = case_when(
        percentile == 90 & Year == 2024 ~ "**Top<br>scorers**",
        percentile == 10 &
          Year == 2024 ~ "**Lowest<br> scorers**",
        (percentile >= 25 | percentile <= 90) & Year == 2024 ~
          glue("{percentile}th<br>percentile"),
        .default = NA_character_
      )
    ) 
  
  plot = df_subset |> ggplot(aes(
    x = Year,
    y = score,
    group = percentile,
    color = is_extreme_score
  )) +
    geom_point() +
    geom_line() +
    geom_richtext(
      aes(label = pretty_label),
      fill = NA,
      na.rm = TRUE,
      label.color = NA,
      hjust = 0,
      size = 4, 
      lineheight = 0.75 
    ) +
    theme_minimal() +
    labs(x = NULL, 
         y = NULL, 
         title = plot_title) +
    theme(
      text = element_text(family = 'franklin'),
      plot.title = element_textbox_simple(size = 12, 
                                          width = NULL, 
                                          padding = margin(0,0,10,0)
      ),
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank(),
      panel.grid.minor.y = element_blank(),
      axis.ticks.x.bottom = element_line(),
      axis.ticks.length.x.bottom = unit(0.2, "cm"),
      plot.margin = margin(0, 2, 0.5, 0.5, "cm"),
      axis.text.x = element_text(size = 8, family = 'franklin'),
      axis.text.y = element_text(size = 8, hjust = 1,
                                 margin = margin(0, 6, 0, 0)),
      plot.title.position = 'plot'
    ) +
    scale_color_manual(breaks = c('yes', 'no'),
                       values = c('#b35f57', '#aaaaaa')) +
    guides(color = 'none') +
    coord_cartesian(expand = F, clip = 'off') +
    scale_y_continuous(limits = y_limits, breaks = y_breaks) +
    scale_x_continuous(
      limits = c(2000, 2024),
      breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024),
      labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24")
    )
  
  return(plot)
}

filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv'
df = read_csv(filename) 

caption_text <- paste(
  "Top scorers shown are at the 90th percentile; lowest scorers are at the 10th.",
  "Scores are from the National Assessment of Educational Progress,",
  "which tests a national sample of students to track educational achievement.",
  "Source: NAEP. By Francesca Paris. Recreated by Akshay Prasadan.",
  sep = "\n"
)

plot_math_final = plot_generator_final(df, 'math')
plot_reading_final = plot_generator_final(df, 'reading')

plot <- ggdraw() +
  draw_plot(plot_math, x = 0, y = 0.10, width = 0.5, height = 0.85) +
  draw_plot(plot_reading, x = 0.5, y = 0.10, width = 0.5, height = 0.85) +
  draw_label(caption_text,
             x = 0.03, y = 0.016, hjust = 0, vjust = 0,
             size = 9, lineheight = 1.2,
             fontfamily = 'franklin',
             fontface = "plain", color = 'grey40')

ggsave("replicated_nyt.png", plot_final, width = 6.6, height = 6.1, dpi = 300, bg = 'white')

Citation

For attribution, please cite this work as:

Prasadan, Akshay. 2025. “Let’s Get in the Newspaper.” May 26, 2025. https://akprasadan.github.io/posts/2025-05-26-pandemic-math-replication/.

--- title: "Let's Get in the Newspaper" description: "Learn from the best by replicating a New York Times graphic on America's educational decline." author: - name: Akshay Prasadan url: https://akprasadan.github.io/ affiliation: PhD in Statistics & Data Science from Carnegie Mellon University date: 5-26-2025 categories: [R, dplyr, ggplot] # self-defined categories citation: url: https://akprasadan.github.io/posts/2025-05-26-pandemic-math-replication/ appendix-cite-as: display image: replicated_nyt.png format: html: anchor-sections: true toc: true code-fold: true code-tools: true knitr: opts_chunk: message: false warning: false --- # Introduction A really valuable exercise in learning about data visualization with `ggplot` is finding high quality graphics found in popular newspapers or research agencies and attempting to recreate them from scratch. I was inspired by Dr. Patrick Schloss at the University of Michigan doing the same on his [Riffomonas Project YouTube channel](https://www.youtube.com/c/RiffomonasProject), and as my second blog post, wanted to contribute my own example. I hope to give those new to `ggplot` the confidence to create publication-worthy graphics, and what better way to start than by mimicking the experts from the newspaper. I decided to replicate a graph from a [New York Times article](https://www.nytimes.com/2025/04/07/us/low-performing-students-reasons.html) on the decline of math and reading scores in the United States, among 8th and 4th graders, respectively. Below I display the graphic, produced by [Francesca Paris](https://www.nytimes.com/by/francesca-paris) using data from the [National Assessment of Educational Progress](https://www.nationsreportcard.gov/ndecore/landing) (NAEP). ![NYT Graphic](nyt_graphic.png){width=6.6in} The structure of this document is as follows. I'm going to start with a basic attempt, and then in several steps, ramp up the complexity until we get a plot that is nearly identical to the original graphic. At each step, I'm going to show you the code, with comments highlighting the main additions relative to the prior version. If you just want the entire code in one place and without my comments, skip to the very end. You can also click the "Show All Code" or "Hide All Code" button in the top right of this page if you have a preference. # Getting Started I have obtained the data from the NAEP, but it wasn't in a [tidy form](https://r4ds.hadley.nz/data-tidy.html). I did some data cleaning and saved it as a clean .csv file you can [access from this URL](https://raw.githubusercontent.com/akprasadan/akprasadan.github.io/refs/heads/master/posts/2025-05-26-pandemic-math-replication/national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv) or by downloading it [from my GitHub](https://github.com/akprasadan/akprasadan.github.io/tree/master/posts/2025-05-26-pandemic-math-replication). Since this post isn't about data wrangling but data visualization, I will spare you the details. Proceeding to our first visual, we see that our tibble has columns `Year`, `percentile` (10, 25, 50, 75, or 90), `score`, and `subject`. We will use the `geom_line()` and `geom_point()` geometries, with aesthetics `x = Year`, `y = score`, and `group = percentile`, and lastly we facet by `subject` (i.e., a separate plot per value of `subject`). Well, let's pause for a moment. We *could* facet by `subject`, but I realized later when creating this post that we lose the ability to easily set $y$-limits for each faceted panel separately. Instead, we will create *two* separate plot and put them together using the `cowplot` package. For now, let us just focus on the `subject = math` plot, since the code is the same. While we're loading in packages, I'm going to import the Libre Franklin font from Google Fonts as that resembles the font used by the New York Times. We'll actually change the fonts at the end. ```{r} #| code-fold: show #| fig.cap: "A basic first attempt." library(tidyverse) library(ggtext) # Formats markdown in plot text library(glue) # String interpolation library(showtext) # Changing plot font library(cowplot) # Combining ggplots in a grid font_add_google(name = 'Libre Franklin', family = 'franklin') # Access on Github filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv' df = read_csv(filename) |> filter(Year >= 2000) |> # focus on math for now filter(subject == 'math') plot = df |> ggplot(aes(x = Year, y = score, group = percentile # one line per percentile )) + geom_point() + geom_line() plot ``` Already, we have made a great start! Let's take care of some low hanging fruit. We don't need the $x$-axis label, and the $y$-axis label information will be present in the title. The title will be formatted with markdown using the `ggtext` package. Let's reduce the unnecessary theme elements by adding `theme_minimal()`, including the grey background. Then we will remove the vertical grid-lines and the minor horizontal ones using the `panel.grid.*` arguments to `theme()`. We will also add back tick marks which were removed by `theme_minimal()`. ```{r} #| code-fold: show #| fig.cap: "Cleaning up the background, grid lines, and ticks." plot = df |> ggplot(aes(x = Year, y = score, group = percentile)) + geom_point() + geom_line() + # Unnecessary theme-ing removed theme_minimal() + labs(x = NULL, y = NULL, title = "**Math** scores for **8th graders**") + theme(# format facet labels with markdown plot.title = element_markdown(), # Now remove a bunch of grid-lines panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), # add back x axis ticks and change their length axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm")) plot ``` Next we need to add some color. The author chose to place the lowest and highest deciles in a different color than the rest. Thus, we will create a binary indicator column reflecting this grouping, prior to our `ggplot`. After that, we can add a color aesthetic according to this grouping. Using the 'Eyedropper' tool in Mozilla Firefox, I determined the author used colors '#b35f57' and '#aaaaaa'. We will use `scale_color_manual()` to implement this new color scale. The legend is unnecessary, so we remove it using `guides()`. ```{r} #| code-fold: show #| fig.cap: "Adding color." plot = df |> mutate(is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no")) |> ggplot(aes(x = Year, y = score, group = percentile, color = is_extreme_score)) + geom_point() + geom_line() + theme_minimal() + labs(x = NULL, y = NULL, title = "**Math** scores for **8th graders**") + theme(plot.title = element_markdown(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm")) + scale_color_manual( # Specify breaks explicitly so we get colors in the right order breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') # unnecessary legend plot ``` We now add the labels. This will require creating another column `pretty_label` in our dataset with the label. This will require some case-work with the `case_when()` function. Since we only want a label by the 2024 dot, we need to set the label as `NA` for all other years. The top and bottom percentiles have their own unique label, and for the middle percentiles, we can do some string interpolation using the `glue` package. After that, we can call the `geom_richtext()` function, which is an extension of `geom_text()` from the `ggtext` package to allow for Markdown formatting. In particular, we bold the extremes and add a line break with ` `. We also want the labels to the right, which we set using `hjust`. ```{r} #| code-fold: show #| fig.cap: "Labels with HTML/Markdown formatting." plot = df |> mutate(is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"), pretty_label = case_when( percentile == 90 & Year == 2024 ~ "**Top scorers**", percentile == 10 & Year == 2024 ~ "**Lowest scorers**", (percentile >= 25 | percentile <= 90) & Year == 2024 ~ glue("{percentile}th percentile"), .default = NA_character_ )) |> ggplot(aes(x = Year, y = score, group = percentile, color = is_extreme_score)) + geom_point() + geom_line() + geom_richtext(aes(label = pretty_label), fill = NA, # text box should be transparent na.rm = TRUE, label.color = NA, # Remove the box outline hjust = 0 # re-position text to right ) + theme_minimal() + labs(x = NULL, y = NULL, title = "**Math** scores for **8th graders**") + theme(plot.title = element_markdown(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm")) + scale_color_manual(breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') plot ``` We need to fix the limits, margins, and breaks of the axes. For example, we want to prevent plot items like the text from being clipped by the panel margins, and instead be clipped by the plot boundaries itself.To do so, set `clip = F` with the `coord_cartesian()` function. While we're at it, we can set `expand = F` to avoid the unnecessary expansion `ggplot` adds by default to the scale. Instead, we manually set the $x$ and $y$ limits with the `scale_*_continuous()` functions. It was at this point I realized why `facet_wrap` would fail: I needed to manually set the axes limits for each faceted plot separately, since math and reading are on different scales. The solution is to just make two separate `ggplot` objects and combine with the `cowplot` package later. In addition to changing the limits, we can control the exact axis ticks that appear and their labels using the `breaks` and `label` argument of the `scale_x_continuous()` function. This requires some tedious relabeling of the axes. Even after this, we still need more space though. So we will directly set the `plot.margin` argument to give more space to the right. ```{r} #| code-fold: show #| fig.cap: "Fixing axis limits, clipping, and axis labels." plot = df |> mutate(is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"), pretty_label = case_when( percentile == 90 & Year == 2024 ~ "**Top scorers**", percentile == 10 & Year == 2024 ~ "**Lowest scorers**", (percentile >= 25 | percentile <= 90) & Year == 2024 ~ glue("{percentile}th percentile"), .default = NA_character_ )) |> ggplot(aes(x = Year, y = score, group = percentile, color = is_extreme_score)) + geom_point() + geom_line() + geom_richtext(aes(label = pretty_label), fill = NA, na.rm = TRUE, label.color = NA, hjust = 0) + theme_minimal() + labs(x = NULL, y = NULL, title = "**Math** scores for **8th graders**") + theme(plot.title = element_markdown(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm"), plot.margin = margin(0.5,2,0.5,0.5, "cm") # Add space on right of plot ) + scale_color_manual(breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') + coord_cartesian(expand = F, clip = 'off') + scale_y_continuous(limits = c(210, 340), # Where to start and end the y axis breaks = seq(220, 320, by = 20) # Where to put tick marks # Note, no labels argument necessary because we literally # want to show the integer as the label ) + scale_x_continuous( # where to start and end the x axis limits = c(2000, 2024), # where the tick marks belong breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024), # what to label the tick marks that we picked using breaks labels = c("'00", "'03", "'07", "'11", "'15","'19", "'24") ) plot ``` # Handling two plots at a time This looks close to completion for the math version. Let's now write a function `plot_generator()` that makes the math *or* reading version of the plot. The function will take in the original tibble with both subjects included as well as the desired subject and output the specified plot. The function first does the necessary filtering and creation of new columns. The filtering part is a bit complicated, since we want to insert the `subject` argument into a `dplyr` function. This requires so called 'tidy evaluation' using the `!!` injection operator function from the `rlang` package. After subsetting and mutating the tibble, the function returns the plot. The only components of the plot that are unique to each subject are the arguments to `scale_y_continuous()` and the title, so we'll define those conditional on the subject. After that, we can generate the combined plot by calling the function `plot_generator()` twice and then using the simple `cowplot` library syntax. We can add in the caption using the `plot_annotation()` function. ```{r} #| code-fold: show plot_generator = function(df, subject) { if (subject == 'math') { plot_title = "**Math** scores for **8th graders**" y_limits = c(210, 340) y_breaks = seq(220, 320, by = 20) } else if (subject == 'reading') { plot_title = "**Reading** scores for **4th graders**" y_limits = c(150, 270) y_breaks = seq(160, 260, by = 20) } else( stop("Pass in either 'math' or 'reading' as an argument for subject.") ) df_subset = df |> filter(Year >= 2000) |> # Tricky note: Use rlang syntax for tidy evaluation filter(subject == !!subject) |> mutate( is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"), pretty_label = case_when( percentile == 90 & Year == 2024 ~ "**Top scorers**", percentile == 10 & Year == 2024 ~ "**Lowest scorers**", (percentile >= 25 | percentile <= 90) & Year == 2024 ~ glue("{percentile}th percentile"), .default = NA_character_ ) ) plot = df_subset |> ggplot(aes( x = Year, y = score, group = percentile, color = is_extreme_score )) + geom_point() + geom_line() + geom_richtext( aes(label = pretty_label), fill = NA, na.rm = TRUE, label.color = NA, hjust = 0 ) + theme_minimal() + labs(x = NULL, y = NULL, title = plot_title) + theme( plot.title = element_markdown(), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm"), plot.margin = margin(0.5, 2, 0.5, 0.5, "cm") ) + scale_color_manual(breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') + coord_cartesian(expand = F, clip = 'off') + scale_y_continuous(limits = y_limits, breaks = y_breaks) + scale_x_continuous( limits = c(2000, 2024), breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024), labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24") ) return(plot) } ``` Now let's show a workflow to generate the plot. The placement of the individual plots in the `cowplot` can be modified using the `draw_plot()` arguments, which use relative scaling values (e.g., 0.15 means 15\%). We can also add the caption with `draw_label()`. The caption also has a grey dot that I couldn't replicate since the `draw_label()` function doesn't support HTML customization (it is basically a wrapper for `geom_label()`, not something from the `ggtext` package). For similar reasons, I could not add the hyperlinks. ```{r} #| code-fold: show #| fig.cap: "Our first combo plot." # Access on GitHub (see beginning of post for link) filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv' df = read_csv(filename) plot_math = plot_generator(df, 'math') plot_reading = plot_generator(df, 'reading') caption_text <- paste( "Top scorers shown are at the 90th percentile; lowest scorers are at the 10th.", "Scores are from the National Assessment of Educational Progress,", "which tests a national sample of students to track educational achievement.", "Source: NAEP. By Francesca Paris. Recreated by Akshay Prasadan.", sep = "\n" ) # Combine with cowplot plot <- ggdraw() + draw_plot(plot_math, x = 0, y = 0.10, width = 0.5, height = 0.85) + draw_plot(plot_reading, x = 0.5, y = 0.10, width = 0.5, height = 0.85) + draw_label(caption_text, x = 0.03, y = 0.016, hjust = 0, vjust = 0, size = 9, lineheight = 1.2, fontfamily = 'franklin', fontface = "plain", color = 'grey40') plot ``` Wow! That's nearly perfect! It's time for the final batch of editing. I find it easiest to first fix a size, and then save your plot using `ggsave` with those precise dimensions. Then, I fine-tune the margins or font sizes until it looks appropriate for that fixed dimension. If you rely on RStudio's plotting window, then the sizes will vary depending on your zoom level or the size of the window on your monitor. This is not reproducible. Recall I imported the Libre Franklin font, which is an approximation of the NYT's font for graphics. Now I'm going to actually apply that font. After that I will make several minor sizing tweaks. This part isn't very interesting, but I'll comment my main changes. ```{r} #| code-fold: show showtext_opts(dpi = 300) showtext_auto() plot_generator_final = function(df, subject) { if (subject == 'math') { plot_title = "**Math** scores for **8th graders**" y_limits = c(210, 340) y_breaks = seq(220, 320, by = 20) } else if (subject == 'reading') { plot_title = "**Reading** scores for **4th graders**" y_limits = c(150, 270) y_breaks = seq(160, 260, by = 20) } else( stop("Pass in either 'math' or 'reading' as an argument for subject.") ) df_subset = df |> filter(Year >= 2000) |> filter(subject == !!subject) |> mutate( is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"), pretty_label = case_when( percentile == 90 & Year == 2024 ~ "**Top scorers**", percentile == 10 & Year == 2024 ~ "**Lowest scorers**", (percentile >= 25 | percentile <= 90) & Year == 2024 ~ glue("{percentile}th percentile"), .default = NA_character_ ) ) plot = df_subset |> ggplot(aes( x = Year, y = score, group = percentile, color = is_extreme_score )) + geom_point() + geom_line() + geom_richtext( aes(label = pretty_label), fill = NA, na.rm = TRUE, label.color = NA, hjust = 0, size = 4, # Make label font size close to title font (units are weird) lineheight = 0.75 # Reduce line spacing of labels ) + theme_minimal() + labs(x = NULL, y = NULL, title = plot_title) + theme( text = element_text(family = 'franklin'), plot.title = element_textbox_simple(size = 12, width= NULL, # Prevent line break in titles padding = margin(0,0,10,0) ), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm"), plot.margin = margin(0, 2, 0.5, 0.5, "cm"), # Refine axis tick label sizes, i.e., the "'00", "'03", etc. axis.text.x = element_text(size = 8, family = 'franklin'), axis.text.y = element_text(size = 8, hjust = 1, margin = margin(0, 6, 0, 0)), # Make sure plot title starts at plot edge (left-most boundary), # not panel edge (panel = subset of plot) plot.title.position = 'plot' ) + scale_color_manual(breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') + coord_cartesian(expand = F, clip = 'off') + scale_y_continuous(limits = y_limits, breaks = y_breaks) + scale_x_continuous( limits = c(2000, 2024), breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024), labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24") ) return(plot) } plot_math_final = plot_generator_final(df, 'math') plot_reading_final = plot_generator_final(df, 'reading') plot_final <- ggdraw() + draw_plot(plot_math_final, x = 0, y = 0.12, width = 0.5, height = 0.85) + draw_plot(plot_reading_final, x = 0.5, y = 0.12, width = 0.5, height = 0.85) + draw_label(caption_text, x = 0.03, y = 0.015, hjust = 0, vjust = 0, size = 9, lineheight = 1.2, fontfamily = 'franklin', fontface = 'plain', color = 'grey40') # Save ggsave("replicated_nyt.png", plot_final, width = 6.6, height = 6.1, dpi = 300, bg = 'white') ``` ![Final version of replicated graphic.](replicated_nyt.png){width=1980px} Looks pretty good, right? Let me repost below the original. ![Original NYT Graphic.](nyt_graphic.png){width=1980px} # Final Remarks A careful inspection reveals many additional deficiencies in my replication attempt. For example, it seems the author actually used `alpha` as an aesthetic for the column we called `is_extreme_score`, since the percentiles seem to change transparency for the 25th, 50th, and 75th levels. The fonts seem a bit thicker too for the extremes. It would probably be easiest to just manually annotate the plot for finer font control. At this point, I feel like I'm giving up reproducibility, D.R.Y. code, and other programming tenets, but haven't I turned from a programmer into a graphic designer at this point anyway? Moreover, I still do not fully understand the best way to fine-tune margins of the various plot objects, since practically every `theme` argument has its own margin argument. So the spacing certainly doesn't match the original. At a certain point, I basically gave up out of frustration. But this is an exercise in learning `ggplot`, not patience. My central goal was to prove that I, and more importantly, you, the reader, can build high quality graphics that belong in leading newspaper publications. Granted, all the credit goes to the New York Times and Francesca Paris for their original design. I look forward to some future exercises like this, perhaps with The Economist, Bloomberg, or Financial Times instead. # The complete code from start to finish ```{r} #| code-fold: show #| eval: false library(tidyverse) library(ggtext) library(glue) library(showtext) library(cowplot) font_add_google(name = 'Libre Franklin', family = 'franklin') showtext_opts(dpi = 300) showtext_auto() plot_generator_final = function(df, subject) { if (subject == 'math') { plot_title = "**Math** scores for **8th graders**" y_limits = c(210, 340) y_breaks = seq(220, 320, by = 20) } else if (subject == 'reading') { plot_title = "**Reading** scores for **4th graders**" y_limits = c(150, 270) y_breaks = seq(160, 260, by = 20) } else( stop("Pass in either 'math' or 'reading' as an argument for subject.") ) df_subset = df |> filter(Year >= 2000) |> filter(subject == !!subject) |> mutate( is_extreme_score = ifelse(percentile == '10' | percentile == '90', "yes", "no"), pretty_label = case_when( percentile == 90 & Year == 2024 ~ "**Top scorers**", percentile == 10 & Year == 2024 ~ "**Lowest scorers**", (percentile >= 25 | percentile <= 90) & Year == 2024 ~ glue("{percentile}th percentile"), .default = NA_character_ ) ) plot = df_subset |> ggplot(aes( x = Year, y = score, group = percentile, color = is_extreme_score )) + geom_point() + geom_line() + geom_richtext( aes(label = pretty_label), fill = NA, na.rm = TRUE, label.color = NA, hjust = 0, size = 4, lineheight = 0.75 ) + theme_minimal() + labs(x = NULL, y = NULL, title = plot_title) + theme( text = element_text(family = 'franklin'), plot.title = element_textbox_simple(size = 12, width = NULL, padding = margin(0,0,10,0) ), panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank(), panel.grid.minor.y = element_blank(), axis.ticks.x.bottom = element_line(), axis.ticks.length.x.bottom = unit(0.2, "cm"), plot.margin = margin(0, 2, 0.5, 0.5, "cm"), axis.text.x = element_text(size = 8, family = 'franklin'), axis.text.y = element_text(size = 8, hjust = 1, margin = margin(0, 6, 0, 0)), plot.title.position = 'plot' ) + scale_color_manual(breaks = c('yes', 'no'), values = c('#b35f57', '#aaaaaa')) + guides(color = 'none') + coord_cartesian(expand = F, clip = 'off') + scale_y_continuous(limits = y_limits, breaks = y_breaks) + scale_x_continuous( limits = c(2000, 2024), breaks = c(2000, 2003, 2007, 2011, 2015, 2019, 2024), labels = c("'00", "'03", "'07", "'11", "'15", "'19", "'24") ) return(plot) } filename = 'national_math_8th_reading_4th_scores_1990_to_2024_by_percentile.csv' df = read_csv(filename) caption_text <- paste( "Top scorers shown are at the 90th percentile; lowest scorers are at the 10th.", "Scores are from the National Assessment of Educational Progress,", "which tests a national sample of students to track educational achievement.", "Source: NAEP. By Francesca Paris. Recreated by Akshay Prasadan.", sep = "\n" ) plot_math_final = plot_generator_final(df, 'math') plot_reading_final = plot_generator_final(df, 'reading') plot <- ggdraw() + draw_plot(plot_math, x = 0, y = 0.10, width = 0.5, height = 0.85) + draw_plot(plot_reading, x = 0.5, y = 0.10, width = 0.5, height = 0.85) + draw_label(caption_text, x = 0.03, y = 0.016, hjust = 0, vjust = 0, size = 9, lineheight = 1.2, fontfamily = 'franklin', fontface = "plain", color = 'grey40') ggsave("replicated_nyt.png", plot_final, width = 6.6, height = 6.1, dpi = 300, bg = 'white') ```