library(ggplot2)

ggplot2 is one of the best known visualization tools for R – and in data science more broadly. Its underlying philosophy, the “grammar of graphics” makes it very versatile and fairly modular.

In this introduction, I will explain the basic syntax of the package and demonstrate how to generate a few common graphs. We will be using the built-in ToothGrowth dataset.

data(ToothGrowth)
head(ToothGrowth)
##    len supp dose
## 1  4.2   VC  0.5
## 2 11.5   VC  0.5
## 3  7.3   VC  0.5
## 4  5.8   VC  0.5
## 5  6.4   VC  0.5
## 6 10.0   VC  0.5

This dataset comes from the study “The growth of the odontoblast of the incisor teeth as a criterion of vitamin C intake of the guinea pig”(Crampton et al., 1947), which compared the length of odontoblasts, responsible for tooth growth in guinea pigs given varying doses of vitamin C as either orange juice, OJ, or ascorbic acid, VJ.

Syntax

ggplot2 requires three elements to generate a plot

Data

Data is provided to the ggplot function as the first parameter.

p = ggplot(ToothGrowth)
p

Note that the output here isn’t very informative. That’s because ggplot requires a second parameter, the mapping, which must be wrapped inside of the aes function. aes takes a wide set of parameters, including

  • x

  • y

  • alpha

  • color

  • fill

  • group

  • size

  • weight

  • label

What do these parameters do? It’s entirely dependent on choice of geometry, more coming on that very soon.

First, let’s define a mapping for the ToothGrowth data. Because dose and len are both continuous and numeric, we’ll set them to x and y, respectively. supp is a factor representing experimental condition, so let’s try color.

tooth_aes = aes(x = dose, y = len, color = supp)

Let’s try ggplot again.

p = ggplot(ToothGrowth, tooth_aes)
p

We’re getting close! There’s a grid with axes. To represent our data on the grid, we’re going to need

Geomtry

Geometry describes the ways that data points can be visualized

One variable

Some geometries only work with one of x or y passed to aes. Let’s use len

p_len =  ggplot(ToothGrowth, aes(x = len, fill = supp))

geom_histogram generates a histogram. If a categorical variable is assigned to the mapping (e.g. fill) it will be accounted for in the representation.

p_len + geom_histogram()

geom_density generates what is essentially a smoothed histogram.We’re going to want to set the alpha parameter of geom_density to a value less than 1 (say, .5) to lower the opacity of the filled curves so both distributions become visible.

p_len + geom_density(alpha = .5)

geom_dotplotgenerates a traditional dotplot, very similar to the histogram function.

p_len + geom_dotplot()

Two variables

Those plots were great, but we’re going to need TWO variables to show off a few of the fancier geometries. They can be broken down by the number of continuous and discrete variables they represent.

Two continuous variables

First, let’s generate a base object with our continous variables and sample groups.

p_2_c = ggplot(ToothGrowth, aes(x = dose, y = len, color = supp))

geom_point generates a basic scatter plot.

p_2_c + geom_point()

geom_jitter is very similar to geom_point, but it adds randomness to the locations of data points. This is useful in cases in which continuous data is measured at discrete values, like in this example.

p_2_c + geom_jitter()

geom_smooth fits a spline and an uncertainty envelope to the data.

p_2_c + geom_smooth()

These functions are a great combination to combine!

p_2_c + geom_jitter() + geom_smooth()

This is a good time to note that you can combine any set of geometries provided that their mappings do not interfere.

Continuous and Discrete Variables

Let’s define graphs with two mappings, with group = dose in one and group = supp for the other.

p_dose = ggplot(ToothGrowth, aes(y = len, group = dose))
p_supp = ggplot(ToothGrowth, aes(y = len, group = supp))

Here, we use geom_boxplot to generate boxplots grouped by dose and supp.

p_dose + geom_boxplot()

p_supp + geom_boxplot()

Note that a boxplot generated withone continuous variable is also valid!

geom_violin is a similar plot, but calculates and plots density as a continuous value found with a kernel function as opposed to by quartile.

p_supp = ggplot(ToothGrowth, aes(y = len, x = supp, fill = supp))

p_supp + geom_violin()

And of course, we can make a bar plot.

ggplot(ToothGrowth, aes(y = len, x = dose))+geom_col(stat = "count")

Two Discrete Variables

The most common representation of data defined by two categorical variables is a heatmap! While we cover the base heatmap function in the class, you can also make one with ggplot2, albeit with a few extra steps.

We first use the table function to generate a frequency table for our variables. We then use the output of this to generate a data frame with three columns, where the first two describe all possible combinations of discrete variables (found with the expand.grid function), and the third is a flattened version of the frequency table, found with the matrix function. Note that, to ensure that frequency values match their original categories after flattening, pass the rows to expand.grid before columns. We then pass this data frame to ggplot, set fill to the frequency column, and apply the geom_tile geometry.

tooth_freq = table(ToothGrowth[,c("len","dose")])
heatmap_df = data.frame(expand.grid(len = rownames(tooth_freq),
                                    dose = colnames(tooth_freq)),
                        freq = matrix(tooth_freq))
ggplot(heatmap_df, aes(x = dose, y = len, fill = freq)) + geom_tile()

Miscellaneous

Setting the frame size

There are often cases in which we want to limit the data shown along the axes of a graph. There are two common ways to subset the visible area. coord_cartesian subsets the visible area without modifying the underlying data.

ex_plot = ggplot(ToothGrowth, aes(x = len, y = dose, color = supp)) + geom_jitter() + geom_smooth() 

ex_plot+ coord_cartesian(xlim = c(10,20))

Directly using the xlim and ylim functions instead subsets the data, and consequently the plot itself.

ex_plot +xlim(10,20)

Note that the envelopes generated by these methods look pretty different! That’s because the data used to generate the plot matters especially in cases in which features learned from the data are overlaid. Here, the smoothing calculation has a much more limited set of data to train on and is therefore less smooth than the original. So, pick your viewable area wisely!

Graphing functions

Somtimes (especially when attempting to fit a model to data) we want to superimpose a function over a plot. This is easy with stat_function, which is passed an anonymous function. Note that, if there isn’t any underlying data, we should specify the xlim.

ggplot() + stat_function(fun = function(x){x^2-2*x+1-10*sin(x)}) + xlim(-10,10)

Labels

We often want to change our title, subtitle, or axis labels from the automatically generated ones. Luckily, we can use the ggtitle, xlab, and ylab functions to change any of these features by passing our new text to the correct parameters.

p_2_c + geom_jitter() + ggtitle("Odontoblast length as a function of Vitamin C Dosage",
              subtitle = "As measured in 60 guinea pigs") +
              xlab("Vitamin C Dose") +
              ylab("Odontoblast length")

We can also move or remove the legend by passing one of “top”,“bottom”,“left”,“right”, or “none” to the legend.position parameter of the theme function.

p_2_c + geom_jitter() + ggtitle("Odontoblast length as a function of Vitamin C Dosage",
              subtitle = "As measured in 60 guinea pigs") +
              xlab("Vitamin C Dose") +
              ylab("Odontoblast length") +
  theme(legend.position = "top")

Themes

One of the coolest parts of ggplot2 is the number of freely available plot themes! There really are predesigned themes in any style that you would want, if you look hard enough. Here are a few found in the ggthemes package.

library(ggthemes)
base_plot = p_2_c + geom_jitter() + geom_smooth() + ggtitle("Odontoblast length as a function of Vitamin C Dosage",
              subtitle = "As measured in 60 guinea pigs") +
              xlab("Vitamin C Dose") +
              ylab("Odontoblast length")

base_plot + theme_fivethirtyeight()

base_plot + theme_excel_new()

base_plot + theme_solarized()

Additional Resources

This tutorial was not very exhaustive, and you will likely need to learn a few additional features if you plot a lot. Included below are a few resources I refer to. I hope they can help you too!