A Grammar of Graphics
A unified framework for constructing statistical graphics.
In the last set of notes, we saw that there is a side to the discipline of Statistics that looks like engineering. Summary statistics - medians, standard deviations, etc. - are carefully crafted tools that capture different characteristics of a data set for use in very specific situations. There is another practice in statistics that looks more like a science; that is, a field that seeks to take many different phenomena and explain them using a systematic overarching theory. That practice is data visualization.
At this point in the course, you’ve seen several several examples of data visualizations.
The diversity of shapes and structures used in these plots suggest that each one is a thing-unto-itself, specially devised to provide one particular style of visualization. But what elements do they share?
Focus on the nature of the data being used. Exactly half of the plots above are illustrating the distribution of a single variable; the other half illustrate the relation between two variables. Can you tell which is which?
Consider the manner in which variability in the data is being conveyed used different visual cues. How many of the plots above utilize an x-axis? A y-axis? Color?
Finally, how are the observations finding their way onto the plot? Three of the plots above share the same data variable, utilize the same visual cues, and differ only in the shape used to encode the observations.
By asking these questions, we begin to find recurring structures in a wide range of plot types. These recurring structures have been compiled into a widely-used framework called the Grammar of Graphics.
The Grammar of Graphics
In 1999, a statistician named Leland Wilkinson published the first edition of what has been the most influential work in data visualization, The Grammar of Graphics1. The title is fitting. In the same way that a grammar defines the regular structures and composition of a language, his book outlines a framework to structure statistical graphics.
Nearly every current software tool used to build plots has been informed by this book2. Its influence can be found in Tableau, Plotly, and the Python libraries bokeh, altair, seaborn, and plotnine. The most complete implementation of the grammar is found in an R package called ggplot2 by Hadley Wickham3.
In Wickham’s adaptation of the grammar of graphics, a plot can be decomposed into three primary elements:
- the data,
- the aesthetic mapping of the variables in the data to visual cues, and
- the geometry used to encode the observations on the plot.
Let’s go through each of these components one-by-one to understand the role that they play in a plot like this, which we’ll refer to as the “penguin plot”.
Data
What variables are needed to construct the penguin plot above?
We see bill_length_mm
and bill_depth_mm
; those are labeled clearly on the x and y axes. We must also know the species
of each of these penguins in order to know which color to label each point. In other words, there are three columns of a data frame that we need to have on hand.
Code
select(penguins,
bill_length_mm,
bill_depth_mm,
species)
# A tibble: 333 × 3
bill_length_mm bill_depth_mm species
<dbl> <dbl> <fct>
1 39.1 18.7 Adelie
2 39.5 17.4 Adelie
3 40.3 18 Adelie
4 36.7 19.3 Adelie
5 39.3 20.6 Adelie
6 38.9 17.8 Adelie
7 39.2 19.6 Adelie
8 41.1 17.6 Adelie
9 38.6 21.2 Adelie
10 34.6 21.1 Adelie
# ℹ 323 more rows
In this course we’ve talked plenty about the structure of a data frame, so this part of the grammar of graphics is straight-forward. Be sure that every variable that you wish to include in your plot is present in the same data frame.
More fundamentally, be sure the data you’re using is well-suited to the message you aim to convey with your plot. Many plots go wrong right here at the outset, so be sure you’re on firm footing.
Aesthetics
The most impactful decision that you’ll make when constructing a plot using the grammar of graphics is deciding how to encode variables in a data frame into visual variation in your plot.
The penguin plot relies upon three forms of visual variation. The first is the location along the x-axis. Penguins with longer bills are placed on the right side of the plot and those with shorter bills are placed on the left. Variation in bill depth is captured by variation in the location along the y-axis, which is the second form. The third form is color: each of the three species is designated by one of three colors.
We can summarize this encoding, or “aesthetic mapping”, as:
-
bill_length_mm
is mapped to the x-axis -
bill_depth_mm
is mapped to the y-axis -
species
is mapped to color
These are three of many different techniques for visually encoding variability. Here is a list of the aesthetic attributes that are most commonly used:
-
x
: location along the x-axis -
y
: location along the y-axis -
color
: hue of the mark that represents the observation -
alpha
: the level of transparency of the color -
size
: the size of the mark representing the observation -
shape
: the shape of the mark representing the observation -
fill
: the color of the inside of the representation of an observation
Geometries
With the data set in place and the aesthetic mappings selected, the final choice in making our plot is to decide how to graphically express the observations themselves. For the penguin plot above, each observation in represented by a point, so it is said to use a “point” geometry. That is just one of many options. Other options are listed below.
point
bar
line
histogram
density
violin
dotplot
boxplot
When we speak about whether a plot is a scatter plot, a bar chart, a histogram, etc, we are discussing the geometry of a plot. The impact of this choice can be seen in the following two plots.
Code
library(patchwork)
p1 <- ggplot(penguins, aes(x = bill_length_mm,
y = species)) +
geom_violin()
p2 <- ggplot(penguins, aes(x = bill_length_mm,
y = species)) +
geom_boxplot()
p1 + p2
Both plots share the same data (penguins
) and the same aesthetic mappings(bill_length
to the x-axis and species
to the y-axis). Where they differ is the geometry: the plot on the left uses the violin while the one on the right uses the boxplot.
Example: Arbuthnot
In the Arbuthnot Lab you created your very first plot, which looked like this.
Code
We can now decompose this graphic using the grammar of graphics.
- data: the arbuthnot data frame
- aesthetic mappings: year is mapped to the x-axis, total is mapped to the y-axis
- geometry: the observations are expressed as a line
This process makes clear what decisions were made in constructing the plot and suggests ways in which we might consider changing the graphic. What if we changed the geometry so that the observations are expressed as points?
Code
ggplot(arbuthnot, aes(x = year, y = total)) +
geom_point()
Is this a better graphic? That depends on the message you aim to convey. The line geometry emphasizes the general trend over time. The point geometry makes clearer the total births in each particular year.
What if you wanted to convey both of those messages? If two geometries are compatible, they can be combined.
Code
ggplot(arbuthnot, aes(x = year, y = total)) +
geom_point() +
geom_line()
Is this a better graphic? Again, that depends. It makes it possible to see both the trend over time and the individual observations. But this plot is more complicated and therefore articulates each of those individual messages less clearly. Sometimes simplicity and a single message is best.
Here’s another example of a plot that combines two geometries:
Here, a density geometry is overlaid on top of a histogram geometry. It is unclear whether this was a good design decision. What do you think? What is the advantage of a stair-step histogram versus the smooth line of a density curve?
ggplot2
The Grammar of Graphics is an abstract framework of data visualization and so far we’ve kept things abstract. But to construct your own plots you’ll need to be able to implement them in software. We are fortunate to have a translation of the grammar into readible, flexible, and powerful code: the ggplot2 package in R.
The basic template for describing a plot in code using ggplot2 is:
ggplot(DATAFRAME, aes(ATTR = VAR)) +
geom_GEOMETRY()
Where DATAFRAME
is the name of your data frame, ATTR
is the aesthetic attribute that you’ll be using, VAR
is a variable in your data frame, and GEOMETRY
is the name of a geometry.
Let’s code up the penguin plot bit by bit.
ggplot(penguins)
The ggplot()
function takes as its first argument a data frame. By itself, there’s not much to look at; it just creates our blank canvas.
Let’s now indicate how we want to map our aesthetic attributes to variables in that data frame by including them inside the aes()
function as the second argument to ggplot()
.
Now we’re getting somewhere. Our axes are now set up with labels and values that are determined by looking into the data frame and checking the range of each variable. Note that we can pass more than one aesthetic mapping inside aes()
, we just need to separate them with a comma.
All that is left is to actually put the observations on the plot by declaring what geometry we want. Let’s express each penguin as a point by using +
to add on a new layer called geom_point()
.
ggplot(penguins, aes(x = bill_length_mm,
y = bill_depth_mm)) +
geom_point()
Here is something that you may recognize as a scatter plot. This is a good choice of plot for visualizing the relationship between two numerical variables. The original penguin plot, though, visualizes the relationship between three variables. To complete our plot, let’s add in that third variable by mapping species to the color aesthetic attribute.
ggplot(penguins, aes(x = bill_length_mm,
y = bill_depth_mm,
color = species)) +
geom_point()
Beautiful!
Over the coming weeks, you’ll get lots of practice with ggplot2. It is an incredibly flexible and powerful piece of software that helps you not just build plots, but think about each of the design decisions that you make. Throughout your journey it is helpful to have a source of documentation to learn about the functionality of the tool. The help files on each function are only so useful. A better option is the official ggplot2
documentation: https://ggplot2.tidyverse.org/.
Now that you have a sense of how to build a ggplot, scroll back up to the plots in these notes and click the small arrow above each plot on the left side. This will reveal the code that was used to generate the plot. Copy and paste the code into RStudio to recreate them and try experimenting around with different aesthetic mappings and geometries (note that you’ll need to load in library(tidyverse)
and any data sets that you use).
Summary
The Grammar of Graphics is a framework to express a great variety of statistical graphics in terms of their shared elements. In this framework, the core features of the plot are the data, the aesthetic mapping between aesthetic attributions and variables in the data frame, and the geometry that is used to express the observation. There are a wide range of geometries and aesthetic attributes that can be drawn from and recombined in powerful ways. These notes cover just the fundamentals of the framework, so if you are unsatisfied with the resulting plots, that’s good. In the coming notes, we will polish up these plots to make thoughtful graphics that focus on effectively conveying a single message.
Materials from class
Slides
Footnotes
Wilkinson, Leland. The grammar of graphics. Springer Science & Business Media, 2005.↩︎
For more context around The Grammar of Graphics and the development of modern visualization tools, read the brief Three Waves of Data Visualization by Elijah Meeks, Senior Data Visualization Engineer at Netflix: https://www.tableau.com/about/blog/2019/2/three-waves-data-visualization-brief-history-and-predictions-future-100830.↩︎
The
ggplot2
package is described in the manuscript, A layered grammar of graphics, by Hadley Wickham in the Journal of Computational and Graphical Statistics in 2010.↩︎