Make your title legendary

The bells and whistles of a dumbbell plot on trends in university enrollment.
R
dplyr
ggplot
Author
Affiliation

PhD in Statistics & Data Science from Carnegie Mellon University

Published

May 31, 2025

Introduction

I was recently browsing the excellent R Graph Gallery website and came across a gorgeous dumbbell plot made by Tobias Stalder for TidyTuesday.1 A dumbbell plot is a two-sided extension of a lollipop plot, which is in turn a substitute for the familiar but bland bar chart. The bar is replaced with a single line, and a point is added to the end. Tobias Stalder’s submission showed the large gap in university enrollment between men and women at Historically Black Colleges and Universities (HBCU). I especially liked his choice of using color in the subtitle rather than having a legend taking up unnecessary space.

Tobias Stalder’s TidyTuesday submission

In this post, I decided to find broader country-wide trends in university enrollments stratified by sex and make a similar plot. I visited the US government-ran Digest of Education Statistics website ran by The Institute of Education Sciences. In Chapter 3, under section 303 (Total Fall Enrollment — General), I accessed Table 303.10 to see fall enrollment in the US from 1947 to 2023. The data also includes the breakdown of public versus private as well as full-time versus part-time. For now, I decided to just plot the entire time series, so I won’t include the summary statistics in my plot that Tobias did (e.g., the mean is itself changing over time, so I’m not sure a global mean is very useful).

Before I dive in, let me summarize some of the important tidyverse skills this post covers:

  • Read in Excel files with the readxl package, and ignoring extraneous rows/columns
  • Cleaning column names with the janitor package and its clean_names() function
  • Making dumbbells and lollipops with geom_segment()
  • Feeding in multiple tibbles in your various ggplot geometries (and the inherit.aes argument)
  • Adding transparent background images with geom_image() from the ggimage package
  • Adding color and formatting to text in the titles/captions using the ggtext package
  • Changing the font with the showtext package

Data Cleaning

I downloaded an .xlsx version of Table 303.10. Unfortunately, it comes in a fairly messy format as is typical, so we’ll need to clean it.

Government datasets are rarely ready off the shelf.

We’ll need the readxl package. We can discard the first three rows with the skip argument of read_xlsx(). After that, we’ll use one of my favorite functions which is clean_names() from the janitor package, which converts column names to snake_case. This has practical benefits beyond aesthetics: RStudio’s autocompletion will fail if there are spaces in the column names. This will save you some typing!

There are some footers in the Excel file which we can remove just by filtering for non-NA rows in the male column. We’ll also discard row 5 of the original file, which just contains column numbers, by filtering for year != "1".

Code
library(tidyverse)
library(readxl)
library(janitor)

table_filename = 'tabn303.10.xlsx'

df =
  # First 3 rows are unnecessary
  readxl::read_xlsx(table_filename, skip = 3) |>
  # convert title name to snake case
  janitor::clean_names() |>
  select(year, total = total_enrollment, male, female) |>
  # Removes the footers
  filter(!is.na(male)) |>
  # remove unnecessary column number header row
  filter(year != "1")

The next issue is the unnecessary "\1\" or "\r\n" strings that appear in the year column. We’ll use str_replace() from the stringr package to remove that. Then year needs to converted to numeric once those non-numeric characters are removed. While we are using mutate(), I’ll create proportion male and proportional female columns. Then I’ll select year and the two proportion columns, since we no longer need the full counts. I use the column selection helper starts_with() from the tidyselect package, although with just 2 columns, this is not very useful. I’m highlighting its use for your more complex applications.

Code
df = df |>
  mutate(
    year = str_replace(year, '\\\\1\\\\', ""),
    year = str_replace(year, "\\r\\n", ""),
    year = as.numeric(year),
    prop_male = male / total,
    prop_female = female / total
  ) |>
  select(year, 
         # selection helpers from the tidyselect package
         starts_with('prop')) 

I plan to only plot every other year for a simpler graphic. There change is relatively gradual so showing year to year changes doesn’t really matter much. My data goes from 1947 to 2023, so using the odd years is sufficient.

Also, we will later see that the trend is declining for males and rising for females, with a crossing point in 1979. I’d like to indicate this with a straight line at that point. I don’t want to hard-code it, so I’ll save this information as a one row tibble at this year called final_male_df. I’ll do this calculation before I filter for odd years only, importantly.

Code
final_male_df = df |>
  filter(prop_male > prop_female) |>
  # Get maximal year in which male > female enrollment
  slice_max(order_by = year)

# Odd years only
df = df |> filter(year %in% seq(1947, 2023, by = 2))

The last piece of data wrangling we need is reshaping df into a long format. We want a single \(y\)-axis column, say enrollment_pct, so we’ll pivot the prop_male and prop_female into a sex and enrollment_pct pair of columns. The "prop_" will then become unnecessary, so we’ll remove it with names_prefix = "prop_" .

Code
long_df = df |>
  pivot_longer(
    cols = c(prop_male, prop_female),
    names_to = "sex",
    # Remove 'prop_' from new row names in sex
    names_prefix = "prop_",
    values_to = "enrollment_pct"
  )

Plotting

Our basic Geometries

We will start by adding points with year on the \(x\) axis and enrollment_pct on the \(y\) axis colored by sex. Then we add a segment from the male to female values using geom_segment(). To do so, we actually feed in the wide dataset df because it takes in a y and yend aesthetic for which we will specify prop_male and prop_female, respectively. The long version of the dataset no longer has these columns. However, the df dataframe does not have a sex column, so we don’t want to feed in the previous aesthetics. Thus, we must include an inherit.aes = F argument to geom_segment().

Code
long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point() +
  geom_segment(
    # Use df to access prop_male, prop_female
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    # df doesn't have a sex column
    inherit.aes = F
  ) 

The next step is to add the vertical line for the transition point where more females than males enroll. We will feed in a new dataset again, the final_male_df row we saved. I want the line to occur on the first year in which female enrollment is higher, so I add one to the year. Once again, we’ll need to not inherit the aesthetics. The reason I used geom_segment() instead of geom_vline() is that I wanted finer control over where the line starts and ends.

Code
long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point() +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F
  ) +
  geom_segment(
    # this is why we saved a 1 row tibble, not just the year
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    inherit.aes = F
  ) 

The last geometry I wish to add is geom_image(). The plot is fairly bare, so I wanted a transparent graphic representing education sitting in the bottom right. We’ll have to load in the ggimage package. To add transparency, we copy the technique here and define a transparency function to pass into geom_image().

Code
library(ggimage)

transparent <- function(img) {
  magick::image_fx(img, expression = "0.5*a", channel = "alpha")
}

long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point() +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    inherit.aes = F
  ) +
  ggimage::geom_image(
    # create a temp. tibble to pass in aesthetics
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, 
        image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) 

I’ll add in some styling to these geoms.

Code
long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) 

Colors and Scales

After consulting Color Brewer 2.0, I settled on using #7570b3 for male and #1b9e77 for female. For the bars, I used a transparent version of #d95f02. Let’s create a manual color scale to use these instead. We will remove the legend by specifying guide = 'none'.

Along the way, we can specify the breaks for the \(x\) and \(y\) axes, and use percent_format() from the scales package to get nice percent signs. The expand = c(0, 0) argument removes the automatic space added to the axes, so we can customize the limits manually.

Code
library(scales)

long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data =
      tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0.25, 0.73),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(1950, 2025, by = 10),
    limits = c(1945, 2024),
    expand = c(0, 0)
  ) +
  scale_color_manual(
    breaks = c("male", "female"),
    values = c("#7570b3", '#1b9e77'),
    guide = 'none'
  ) 

Looks good. Let’s fix up the labels. Importantly, I want to put the legend in the title itself. To do that, we use the ggtext package and add the HTML tags in the title. To render it properly, we must call plot.title = element_markdown() inside the theme() function. The HTML format is messy, but goes like this: <span style = 'color:{your-color};'>{your-text}</span>.

The same ggtext package lets me add line breaks to the caption as well, using <br>. We’ll reposition the caption using the plot.caption and plot.caption.position arguments.

Code
library(ggtext)

long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0.25, 0.73),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(1950, 2025, by = 10),
    limits = c(1945, 2024),
    expand = c(0, 0)
  ) +
  scale_color_manual(
    breaks = c("male", "female"),
    values = c("#7570b3", '#1b9e77'),
    guide = 'none'
  ) +
  labs(
    x = "",
    y = "",
    title = "<span style = 'color:#7570b3;'>**Male**</span> and <span style = 'color:#1b9e77;'>**Female**</span> University Enrollment has swapped places.",
    subtitle = "Proportion of Fall Enrollment between 1947 and 2023 (odd years only)",
    caption = "Data sourced from *The Institute of Education Sciences*.<br><br>Plot by Akshay Prasadan."
  ) +
  theme(
    # use element_markdown() to tell ggtext to recognize the HTML
    plot.title = ggtext::element_markdown(),
    # Position caption on side of plot, not the graph panel
    plot.caption.position = 'plot',
    plot.caption = ggtext::element_markdown(hjust = 0) # move caption to left
  )

The default theme looks ugly to me. Let’s remove the grey background, customize which panel gridlines we want, and make other minor changes. The easiest way to do this is to add theme_minimal() and then put back what is of interest.

Code
long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0.25, 0.73),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(1950, 2025, by = 10),
    limits = c(1945, 2024),
    expand = c(0, 0)
  ) +
  scale_color_manual(
    breaks = c("male", "female"),
    values = c("#7570b3", '#1b9e77'),
    guide = 'none'
  ) +
  labs(
    x = "",
    y = "",
    title = "<span style = 'color:#7570b3;'>**Male**</span> and <span style = 'color:#1b9e77;'>**Female**</span> University Enrollment has swapped places.",
    subtitle = "Proportion of Fall Enrollment between 1947 and 2023 (odd years only)",
    caption = "Data sourced from *The Institute of Education Sciences*.<br><br>Plot by Akshay Prasadan."
  ) +
  theme_minimal() +
  theme(
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(color = 'grey90', linetype = 'dotted'),
    axis.ticks.x.bottom = element_line(color = 'black'),
    axis.ticks.length.x.bottom = unit(4, units = 'pt'),
    axis.text.x.bottom = element_text(color = 'black'),
    axis.text.y.left = element_text(color = 'black'),
    plot.title = ggtext::element_markdown(),
    plot.caption.position = 'plot',
    plot.caption = ggtext::element_markdown(hjust = 0)
  )

Fonts

For our final customization, let’s change the font to “Lato” from Google Fonts using the showtext package. We’ll need to set the global size using text = element_text(family = 'Lato', size = 5), since often times the default sizes will be way too big or small. For some weird reason, text that gets modified by element_markdown() sometimes loses their spaces, so to fix that, I added family = "" in the plot.caption= element_markdown() argument.

Code
library(showtext)

# sysfonts is loaded in by the showtext package
sysfonts::font_add_google("Lato")
showtext::showtext_auto()
showtext::showtext_opts(dpi = 300)

long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0.25, 0.73),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(1950, 2025, by = 10),
    limits = c(1945, 2024),
    expand = c(0, 0)
  ) +
  scale_color_manual(
    breaks = c("male", "female"),
    values = c("#7570b3", '#1b9e77'),
    guide = 'none'
  ) +
  labs(
    x = "",
    y = "",
    title = "<span style = 'color:#7570b3;'>**Male**</span> and <span style = 'color:#1b9e77;'>**Female**</span> University Enrollment has swapped places.",
    subtitle = "Proportion of Fall Enrollment between 1947 and 2023 (odd years only)",
    caption = "Data sourced from *The Institute of Education Sciences*.<br><br>Plot by Akshay Prasadan."
  ) +
  theme_minimal() +
  theme(
    # specify a global font size as a default
    text = element_text(family = 'Lato', size = 7),
    panel.grid = element_blank(),
    panel.grid.major.y =
      element_line(color = 'grey90', linetype = 'dotted'),
    axis.ticks.x.bottom = element_line(color = 'black'),
    axis.ticks.length.x.bottom = unit(4, units = 'pt'),
    axis.text.x.bottom = element_text(color = 'black'),
    axis.text.y.left = element_text(color = 'black'),
    plot.title = ggtext::element_markdown(),
    plot.caption.position = 'plot',
    # add family = "" for some annoying bug with spaces
    plot.caption = ggtext::element_markdown(hjust = 0, family = "")
  )

Our final version of the plot.

Full Code

Here is the entire, self-contained code all in one place. I removed the comments but you can find them in the above code chunks.

Code
library(tidyverse)
library(readxl)
library(janitor)
library(ggimage)
library(scales)
library(ggtext)
library(showtext)

sysfonts::font_add_google("Lato")
showtext::showtext_auto()
showtext::showtext_opts(dpi = 300)

transparent <- function(img) {
  magick::image_fx(img, expression = "0.5*a", channel = "alpha")
}

df = readxl::read_xlsx(table_filename, skip = 3) |>
  janitor::clean_names() |>
  select(year, total = total_enrollment, male, female) |>
  filter(!is.na(male)) |>
  filter(year != "1")
  mutate(
    year = str_replace(year, '\\\\1\\\\', ""),
    year = str_replace(year, "\\r\\n", "")
  ) |>
  mutate(
    year = str_replace(year, '\\\\1\\\\', ""),
    year = str_replace(year, "\\r\\n", ""),
    year = as.numeric(year),
    prop_male = male / total,
    prop_female = female / total
  ) |>
  select(year, 
         starts_with('prop')) 

final_male_df = df |>
  filter(prop_male > prop_female) |>
  slice_max(order_by = year)

df = df |>
  filter(year %in% seq(2023, 1947, by = -2))

long_df = df |>
  pivot_longer(
    cols = c(prop_male, prop_female),
    names_to = "sex",
    names_prefix = "prop_",
    values_to = "enrollment_pct"
  )

plot = long_df |>
  ggplot(aes(x = year, y = enrollment_pct, color = sex)) +
  geom_point(size = 2.1) +
  geom_segment(
    data = df,
    aes(x = year, y = prop_male, yend = prop_female),
    inherit.aes = F,
    alpha = 0.3,
    linewidth = 2.2,
    color = '#d95f02'
  ) +
  geom_segment(
    data = final_male_df,
    aes(x = year + 1),
    y = 0.25,
    yend = 0.7,
    color = '#d95f02',
    linewidth = 1,
    alpha = 0.5,
    linetype = 'dotdash',
    inherit.aes = F
  ) +
  ggimage::geom_image(
    data = tibble(year = 1998, enrollment_pct = 0.32),
    aes(x = year, y = enrollment_pct, image = 'generic_students.png'),
    image_fun = transparent,
    size = 0.3,
    inherit.aes = F
  ) +
  scale_y_continuous(
    labels = scales::percent_format(),
    limits = c(0.25, 0.73),
    expand = c(0, 0)
  ) +
  scale_x_continuous(
    breaks = seq(1950, 2025, by = 10),
    limits = c(1945, 2024),
    expand = c(0, 0)
  ) +
  scale_color_manual(
    breaks = c("male", "female"),
    values = c("#7570b3", '#1b9e77'),
    guide = 'none'
  ) +
  labs(
    x = "",
    y = "",
    title = "<span style = 'color:#7570b3;'>**Male**</span> and <span style = 'color:#1b9e77;'>**Female**</span> University Enrollment has swapped places.",
    subtitle = "Proportion of Fall Enrollment between 1947 and 2023 (odd years only)",
    caption = "Data sourced from *The Institute of Education Sciences*.<br><br>Plot by Akshay Prasadan."
  ) +
  theme_minimal() +
  theme(
    text = element_text(family = 'Lato', size = 9),
    panel.grid = element_blank(),
    panel.grid.major.y = element_line(color = 'grey90', linetype = 'dotted'),
    axis.ticks.x.bottom = element_line(color = 'black'),
    axis.ticks.length.x.bottom = unit(4, units = 'pt'),
    axis.text.x.bottom = element_text(color = 'black'),
    axis.text.y.left = element_text(color = 'black'),
    plot.title = ggtext::element_markdown(),
    plot.caption.position = 'plot',
    plot.caption = ggtext::element_markdown(hjust = 0, family = "")
  )

ggsave(
  filename = "final_education_plot.png",
  dpi = 300,
  height = 5,
  width = 8,
  bg = "white"
)

Takeaways

The plot shows that the trend Tobias showed for HBCUs extends nationwide. Since the end of WWII, males went from nearly 71% of all college enrollments to only 42%. This has massive implications for electoral trends given the impact of education on political leanings. But I’ll leave that for Fivethirtyeight or its remnants to discuss.

The main takeaway of this blog post: Use text customization feature to replace existing plot elements like legends that can take up valuable space or contribute to a sense of clutter. Obviously if there are many categories, this is not advisable. But for something as simple as male versus female trends, Tobias had the right idea.

I hope you also enjoyed the other tips and tricks with fonts, background images, dumbbell plot construction, and data cleaning Excel formats.

Footnotes

  1. TidyTuesday is a weekly community ‘event’ (and podcast) in which data enthusiasts around the world share their data cleaning, visualizations and modelling techniques for that week’s data release.↩︎

Citation

For attribution, please cite this work as:
Prasadan, Akshay. 2025. “Make Your Title Legendary.” May 31, 2025. https://akprasadan.github.io/posts/2025-05-31-university-enrollment/.