5 Work with Geoms

Now that we know the core structure of ggplot’s approach to making figures, and are familiar with the elements of grouping, faceting, and transforming data, we can build on this foundation to make new kinds of plots. In effect this means learning new geom_ functions. We can also start to think more about refining how our plots look. This means learning more about supplementary functions that control the scales and legends in our plots, as well as functions that allow us to add arbitrary notes or shapes to plots.

Our approach will not change. No matter how complex our plots get, or how many individual steps we take to layer and tweak a plot’s features, underneath we will still be working with a table of data, a mapping of variables to aesthetics, and a type of graph. If you try not to lose sight of this, it will make it easier to wade in to the detail of getting any particular graph to look just how you want it to look.

In this Chapter, things will get a little more sophisticated in three main ways. First, we will be expanding the number of different geoms we know about, and learning more about how to choose between them. The more we learn about the available geoms, the better we will be able to select the right one given our data and our goals. Second, we will become a little more adventurous when it comes to departing from some of ggplot’s default arguments and settings. This means learning more about the arguments that can be supplied to geom_ functions, and also getting used to layering geoms on top of one another. In particular we will learn how to reorder variables displayed in our figures, and how to subset the data we use before we display it. These techniques can make plots much more legible to readers. They allow us to present data in a way that reflects relevant parts of it structure, and to pick out the elements of it that are of particular interest.

We will begin with a new dataset. Like the Gapminder data, it has a country-year structure. It contains a little more than a decade’s worth of information on organ procurement rates in seventeen OECD countries. This is a measure of the number of human organs obtained from cadaver organ donors for use in transplant operations. Along with the donation data, the dataset has a variety of numerical demographic measures, and several categorical measures of health and welfare policy and law. Unlike the gapminder data, some observations are missing. These are designated with a value of NA, R’s standard code for missing data. The organdata table is included in the socviz library. Load it up and take a quick look. Instead of using head(), this time we make a short pipeline to select the first six columns of the dataset and then pick five rows at random using a function called sample_n(). It takes two main arguments: the table of data you want to sample from—because we are using a pipeline, this is implicit—and the number of draws you want to make.Using numbers this way in select() chooses the numbered columns of the data frame. You can also select variable names directly.

organdata %>% select(1:6) %>% sample_n(size = 10)

## # A tibble: 10 x 6
##           country       year donors   pop pop.dens   gdp
##             <chr>     <date>  <dbl> <int>    <dbl> <int>
##  1    Switzerland         NA     NA    NA       NA    NA
##  2    Switzerland 1997-01-01   14.3  7089    17.17 27675
##  3 United Kingdom 1997-01-01   13.4 58283    23.99 22442
##  4         Sweden         NA     NA  8559     1.90 18660
##  5        Ireland 2002-01-01   21.0  3932     5.60 32571
##  6        Germany 1998-01-01   13.4 82047    22.98 23283
##  7          Italy         NA     NA 56719    18.82 17430
##  8          Italy 2001-01-01   17.1 57894    19.21 25359
##  9         France 1998-01-01   16.5 58398    10.59 24044
## 10          Spain 1995-01-01   27.0 39223     7.75 15720

Lets’s start by naively graphing some of the data. We can take a look at a scatterplot of donors vs year.

Not very informative. Figure 5.1: Not very informative.

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_point()

## Warning: Removed 34 rows containing missing values
## (geom_point).

Ggplot warns you about the missing values. We’ll suppress this warning from now on, so that it doesn’t clutter the output, but in general it’s wise to read and understand the warnings that R gives, even when code appears to run properly. If there are a large number of warnings, R will collect them all and invite you to view them with the warnings() function.

We could use geom_line() to plot each country’s time series, like we did with the gapminder data. To do that, remember, we need to tell ggplot what the grouping variable is. This time we can also facet the figure by country, as we do not have too many of them.

A faceted line plot. Figure 5.2: A faceted line plot.

p <- ggplot(data = organdata,
            mapping = aes(x = year, y = donors))
p + geom_line(aes(group = country)) + facet_wrap(~ country)

By default the facets are ordered alphabetically by country. We will see how to change this momentarily.

5.1 Continuous Variables by Group or Category

Let’s focus on the country-level variation, but without paying attention to the time trend. We can use geom_boxplot() to get a picture of variation by year across countries. Just as geom_bar() by default calculates a count of observations by the category you map to x, the stat_boxplot() function that works with geom_boxplot() will calculate a number of statistics that allow the box and whiskers to be drawn. We tell geom_boxplot() the variable we want to categorize by (here, country) and the continuous variable we want summarized (here, donors)

A first attempt at boxplots by country. Figure 5.3: A first attempt at boxplots by country.

p <- ggplot(data = organdata,
            mapping = aes(x = country, y = donors))
p + geom_boxplot()

The boxplots look interesting but two issues could be addressed. First, it’s awkward to have the country names on the x-axis, it makes them hard to read. (In some cases the labels overlap.) It’s possible to adjust the tick mark labels so that they’re printed at an angle, or vertically, but that isn’t so easy to read, either. It makes more sense to put the countries on the y-axis and the donor measure on the x-axis. Because of the way geom_boxplot() generates the boxes and whiskers internally, the obvious solution— swapping the x and y mapping—will not work. (Try it and see what happens.) The way to solve this system is to adjust the coordinate system that the results are plotted in, so that the x and y axes are flipped. We do this with coord_flip().

Moving the countries to the y-axis. Figure 5.4: Moving the countries to the y-axis.

p <- ggplot(data = organdata,
            mapping = aes(x = country, y = donors))
p + geom_boxplot() + coord_flip()

That’s more legible but not ideal. We generally want our plots to present data in some meaningful order. An obvious way is to have the countries listed from high to low average donation rate. We accomplish this by reordering the country variable by the mean of donors. The reorder() function will do this for us. It takes two required arguments. The first is the categorical variable or factor that we want to reorder. In this case, that’s country. The second is the variable we want to reorder it by. Here that is the donation rate, donors. The third and optional argument to reorder() is the function you want to use as a summary statistic. By default—that is, if you only give reorder() the first two required arguments—it will reorder the categories of your first variable by the mean value of the second. You can name any sensible function you like to reorder the categorical variable (e.g., median, or sd). There is one additional wrinkle. In R, the default mean function will fail with an error if there are missing values in the variable you are trying to take the average of. You must say that it is OK to remove the missing values when calculating the mean. This is done by supplying the na.rm=TRUE argument to reorder(), which internally passes that argument on to mean(). We are reordering the variable we are mapping to the x aesthetic, so we use reorder() at that point in our code:

Boxplots reordered by mean donation rate. Figure 5.5: Boxplots reordered by mean donation rate.

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors))
p + geom_boxplot() +
    labs(x=NULL) +
    coord_flip()

Because it’s obvious what the country names are, in the labs() call we set their axis label to empty with labs(x=NULL). Ggplot offers some variants on the basic boxplot, including the violin plot. Try it with geom_violin(). There are also numerous arguments that control the finer details of the boxes and whiskers, including their width. Boxplots can also take color and fill aesthetic mappings like other geoms.

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, fill = world))
p + geom_boxplot() + labs(x=NULL) +
    coord_flip() + theme(legend.position = "top")

Figure 5.6: A boxplot with the fill aesthetic mapped.

A boxplot with the fill aesthetic mapped.

Putting categorical variables on the y-axis to compare their distributions is a very useful trick. Its makes it easy to effectively present summary data on more categories. The plots can be quite compact and fit a relatively large number of cases in by row. The approach also has the advantage of putting the variable being compared onto the x-axis, which sometimes makes it easier to compare across categories. If the number of observations within each categoriy is relatively small, we can skip (or supplement) the boxplots and show the individual observations, too. Note that in this next example we map the world variable to color instead of fill as the default geom_point() plot shape has a color attribute, but not a fill.

Using points instead of a boxplot. Figure 5.7: Using points instead of a boxplot.

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_point() + labs(x=NULL) +
    coord_flip() + theme(legend.position = "top")

When we use geom_point() like this, there is some overplotting of observations. In these cases, it can be useful to perturb the data just a little bit in order to get a better sense of how many observations there are at different values. We use geom_jitter() to do this. This geom works much like geom_point(), but randomly nudges each observation by a small amount.

Jittering the points. Figure 5.8: Jittering the points.

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_jitter() + labs(x=NULL) +
    coord_flip() + theme(legend.position = "top")

The default amount of jitter is a little too much for our purposes. We can control it using height and width arguments to a position_jitter() function within the geom. Because we’re making a one-dimensional summary here, we just need width.Can you see why we did not use height? If not, try it and see what happens.

p <- ggplot(data = organdata,
            mapping = aes(x = reorder(country, donors, na.rm=TRUE),
                          y = donors, color = world))
p + geom_jitter(position = position_jitter(width=0.15)) +
    labs(x=NULL) + coord_flip() + theme(legend.position = "top")

Figure 5.9: A jittered plot.

A jittered plot.

When we want to summarize a categorical variable that just has one point per category, we should use this approach as well. The result will be a Cleveland dotplot, a simple and extremely effective method of presenting data that is usually better than a bar chart. For example, we can make a Cleveland dotplot of the average donation rate.

This also gives us another opportunity to do a little bit of data munging with a dplyr pipeline. We will use one to aggregate our larger country-year data frame to a smaller table of summary statistics by country. Our goal is a table of average values of some variables in the orginal data, together with a measure of the standard deviation of the donation rate. We will again use the pipe operator, %>%, to do our work.

by_country <- organdata %>% group_by(consent.law, country) %>%
    summarize(don.rate = mean(donors, na.rm = TRUE),
              don.sd = sd(donors, na.rm = TRUE),
              gdp = mean(gdp, na.rm = TRUE),
              health = mean(health, na.rm = TRUE),
              roads = mean(roads, na.rm = TRUE),
              cerebvas = mean(cerebvas, na.rm = TRUE))

The pipeline consists of just two steps. First we group the data by consent.law and country, and then use summarize() to create six new variables, each one of which is the mean or standard deviation of each country’s score on a corresponding variable in the original organdata data frame.For an alternative grouping, change ‘country’ to ‘year’ in the grouping statement and see what happens.

As usual, summarize() step, will inherit information about the original data and the grouping, and then do its calculations at the innermost grouping level. In this case it takes all the observations for each country and calculates the mean or standard deviation as requested. Here is what the resulting object looks like:

by_country

## # A tibble: 17 x 8
## # Groups:   consent.law [?]
##    consent.law        country don.rate don.sd   gdp health roads cerebvas
##          <chr>          <chr>    <dbl>  <dbl> <dbl>  <dbl> <dbl>    <dbl>
##  1    Informed      Australia       11    1.1 22179   1958   105      558
##  2    Informed         Canada       14    0.8 23711   2272   109      422
##  3    Informed        Denmark       13    1.5 23722   2054   102      641
##  4    Informed        Germany       13    0.6 22163   2349   113      707
##  5    Informed        Ireland       20    2.5 20824   1480   118      705
##  6    Informed    Netherlands       14    1.6 23013   1993    76      585
##  7    Informed United Kingdom       13    0.8 21359   1561    68      708
##  8    Informed  United States       20    1.3 29212   3988   155      444
##  9    Presumed        Austria       24    2.4 23876   1875   150      769
## 10    Presumed        Belgium       22    1.9 22500   1958   155      594
## 11    Presumed        Finland       18    1.5 21019   1615    94      771
## 12    Presumed         France       17    1.6 22603   2160   156      433
## 13    Presumed          Italy       11    4.3 21554   1757   122      712
## 14    Presumed         Norway       15    1.1 26448   2217    70      662
## 15    Presumed          Spain       28    5.0 16933   1289   161      655
## 16    Presumed         Sweden       13    1.8 22415   1951    72      595
## 17    Presumed    Switzerland       14    1.7 27233   2776    96      424

Notice that, as before, the variables specified in group_by() are retained in the new data frame, the variables created with summarize() are added, and all the other variables in the original data are dropped. The countries are also summarized alphabetically within consent.law, which was the outermost grouping variable in the group_by() statement at the start of the pipeline. With our data summarized by country, we can draw a dotplot with geom_point(). Let’s also color the results by the consent law for each country.

A Cleveland dotplot, with colored points. Figure 5.10: A Cleveland dotplot, with colored points.

p <- ggplot(data = by_country,
            mapping = aes(x = don.rate,
                          y = reorder(country, don.rate),
                          color = consent.law))
p + geom_point(size=3) +
    labs(x="Donor Procurement Rate",
         y="",
         color="Consent Law") +
    theme(legend.position="top")

Alternatively, if we liked, we could use a facet instead of coloring the points. Using facet_wrap() we can split the consent.law variable into two panels, and then rank the countries by donation rate within each panel. Because we have a categorical variable on our y-axis, there are two wrinkles worth noting. The first is that, if we leave facet_wrap() to its defaults, the panels will be plotted side by side. This will make it difficult to compare the two groups on the same scale. Instead the plot will be read left to right, which is not useful. To avoid this, we will make sure the panels appear one on top of the other by specifying that we want our plot to only have one column. This is the ncol=1 argument. The second wrinkle is that, again because we have a categorical variable on the y-axis, the default facet plot will put lines for all the countries on the y-axis of both panels. (Were the y-axis a continuous variable this would be the what we would want.) In that case, only half the rows in each panel of our plot will have points in them.

To avoid this we allow the y-axes scale to be free. This is the scales="free_y" argument. Again, for faceted plots where both variables are continuous, we generally do not want the scales to be free, because it allows the x- or y-axis for each panel to vary with the range of the data inside that panel only, instead of the range across the whole dataset. Because the point of small-multiple facets is to be able to compare across the panels, free scales are often not a good idea. But where the panels split by group as here, we just free the categorical panel and leave the continuous one fixed. The result is that each panel shares the same x-axis, and it is easy to compare between them.

A faceted dotplot with free scales on the y-axis. Figure 5.11: A faceted dotplot with free scales on the y-axis.

p <- ggplot(data = by_country,
            mapping = aes(x = don.rate,
                          y = reorder(country, don.rate)))

p + geom_point(size=3) +
    facet_wrap(~ consent.law, scales = "free_y", ncol=1) +
    labs(x="Donor Procurement Rate",
         y="") 

Cleveland dotplots are generally preferred to bar or column charts. When making them, put the categories on the y-axis and order them in the way that is most relevant to the numerical summary you are providing. This sort of plot is also an excellent way to summarize model results or any data with with error ranges. Note that we use geom_point() to draw our dotplots. There is a geom called geom_dotplot(), but it is designed to produce a different sort of figure—a kind of histogram, with individual observations represented by dots that are then stacked on top of one another to show how many of them there are.

The Cleveland-style dotplot can be extended to cases where we want to include some information about variance or error in the plot. Using geom_pointrange(), we can tell ggplot to show us a point estimate and a range around it. Here we will use the standard deviation of the donation rate that we calculated above. But this is also the natural way to present, for example, estimates of model coefficients with confidence intervals. With geom_pointrange() we map our x and y variables as usual, but the function needs a little more information than geom_point. It needs to know the range of the line to draw on either side of the point, defined by the arguments ymax and ymin. This is given by the y value (donor) plus or minus its standard deviation (donor.sd). If a function argument expects a number, it is OK to give it a mathematical expression that resolves to the number you want. R will calculate the result for you.

p <- ggplot(data = by_country,
            mapping = aes(x = reorder(country, don.rate), y = don.rate))

p + geom_pointrange(mapping = aes(ymin = don.rate - don.sd, ymax = don.rate + don.sd)) +
    labs(x="", y="Donor Procurement Rate") + coord_flip()

Figure 5.12: A dot-and-whisker plot, with the range defined by the standard deviation of the measured variable.

A dot-and-whisker plot, with the range defined by the standard deviation of the measured variable.

Note that because geom_pointrange() expects y, ymin, and ymax as arguments, we map don.rate to y and the ccode variable to x, then flip the axes at the end with coord_flip(). In addition to geom_pointrange() there is a family of related geoms that produce different kinds of error bars and ranges, depending on your specific needs. They include geom_linerange(), geom_crossbar(), and geom_errorbar().

5.2 Plot Text Directly

It can sometimes be useful to plot the labels along with the points in a scatterplot, or just plot informative labels directly. We can do this with geom_text().

Plotting labels and text. Figure 5.13: Plotting labels and text.

p <- ggplot(data = by_country,
            mapping = aes(x = roads, y = don.rate))
p + geom_point() + geom_text(mapping = aes(label = country))

The text is plotted right on top of the points, because both are positioned using the same x and y mapping. One way of dealing with this, often the most effective if we are not too worried about excessive precision in the graph, is to remove the points by dropping geom_point() from the plot. A second option is to adjust the position of the text. We can left- or right-justify the labels using the hjust argument to geom_text(). Setting hjust=0 will left justify the label, and hjust=1 will right justify it.

Plot points and text labels, with a horizontal position adjustment. Figure 5.14: Plot points and text labels, with a horizontal position adjustment.

p <- ggplot(data = by_country,
            mapping = aes(x = roads, y = don.rate))

p + geom_point() + geom_text(mapping = aes(label = country), hjust = 0)

You might be tempted to try different values to hjust to fine-tune your labels. But this is not a robust approach. It will often fail because the space is added in proportion to the length of the label. The result is that longer labels move further away from their points than you want. Instead, you can add a small constant to the label, like this:

Using <code>hjust</code> and adding a small constant to the x position. Figure 5.15: Using hjust and adding a small constant to the x position.

p <- ggplot(data = by_country,
            mapping = aes(x = roads, y = don.rate))

p + geom_point() + geom_text(mapping = aes(x = roads + 1, label = country), hjust = 0)

Here again we’re using R’s recycling rules to our advantage. The aes(x = roads + 1, ...) statement in effect creates a new variable whose values are very close to that of roads, but just one larger. Because the x mapping in geom_point() is inherited from the p object, all of its values are just those of roads. But each of the labels is now positioned just a slight bit further to the right.

Our plot still isn’t satisfactory, though, because too many of the label names overlap one another (and other points). In addition, several of our labels are clipped by the frame of the plot area and do not show fully. We could fix this manually in different ways, either by using shorter labels or setting the plot area to be larger. But we don’t have to. Instead, we will use ggrepel, a very useful library that adds some new geoms to ggplot. It provides geom_text_repel() and geom_label_repel(), two geoms that can pick out labels much more flexibly than the default geom_text(). First, make sure the library is installed and load it:

library("ggrepel")

Then use its main function, geom_text_repel(), instead of geom_text(). To demonstrate some of what geom_text_repel() can do, here is an example using some historical election data provided in the socviz library.

elections_historic %>% select(2:7) 

## # A tibble: 49 x 6
##     year                 winner win_party ec_pct popular_pct popular_margin
##    <int>                  <chr>     <chr>  <dbl>       <dbl>          <dbl>
##  1  1824      John Quincy Adams     D.-R.  0.322       0.309        -0.1044
##  2  1828         Andrew Jackson      Dem.  0.682       0.559         0.1225
##  3  1832         Andrew Jackson      Dem.  0.766       0.547         0.1781
##  4  1836       Martin Van Buren      Dem.  0.578       0.508         0.1420
##  5  1840 William Henry Harrison      Whig  0.796       0.529         0.0605
##  6  1844             James Polk      Dem.  0.618       0.495         0.0145
##  7  1848         Zachary Taylor      Whig  0.562       0.473         0.0479
##  8  1852        Franklin Pierce      Dem.  0.858       0.508         0.0695
##  9  1856         James Buchanan      Dem.  0.588       0.453         0.1220
## 10  1860        Abraham Lincoln      Rep.  0.594       0.397         0.1013
## # ... with 39 more rows
Text labels with ggrepel. Normally it is not a good idea to label every point on a plot in this way. A better approach would be to select a few points of particular interest.

Figure 5.16: Text labels with ggrepel. Normally it is not a good idea to label every point on a plot in this way. A better approach would be to select a few points of particular interest.

Figure 5.16 takes each U.S. presidential election since 1824 (the first year that the size of the popular vote was recorded), and plots the winner’s share of the popular vote against the winner’s share of the electoral college vote. The shares are stored in the data as proportions (from 0 to 1) rather than percentages, so we need to adjust the labels of the scales using scale_x_continuous() and scale_y_continuous(). Seeing as we are interested in particular presidencies, we also want to label the points. But because many of the data points are plotted quite close together we need to make sure the labels do not overlap with each other, or obscure other points. The geom_text_repel() function handles the problem very well. This plot has relatively long labels. We could easily use them directly in the code, but just to keep things a little tidier we assign the text of the labels to some named objects instead, and then use those in the plot formula.

In this plot, what is of interest about any particular point is the quadrant of the x-y plane each point it is in, and how far away it is from the fifty percent threshold on both the x-axis (with the popular vote share) and the y-axis (with the electoral college vote share). To underscore this point we draw two reference lines at the fifty percent line in each direction. They are drawn at the beginning of the plotting process so that the points and labels can be layered on top of them. We use two new geoms, geom_hline() and geom_vline() to make the lines. They take yintercept and xintercept arguments, respectively, and the lines can also be sized and colored as you please. There is also a geom_abline() geom that draws straight lines based on a supplied slope and intercept. This is useful for plotting, for example, 45 degree reference lines in scatterplots.

p_title <- "Presidential Elections: Popular & Electoral College Margins"
p_subtitle <- "1824-2016"
p_caption <- "Data for 2016 are provisional."
x_label <- "Winner's share of Popular Vote"
y_label <- "Winner's share of Electoral College Votes"

p <- ggplot(elections_historic, aes(x = popular_pct,
                                    y = ec_pct,
                                    label = winner_label))

p + geom_hline(yintercept = 0.5, size = 1.4, color = "gray80") +
    geom_vline(xintercept = 0.5, size = 1.4, color = "gray80") +
    geom_point() +
    geom_text_repel() +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(x = x_label,
         y = y_label,
         title = p_title,
         subtitle = p_subtitle,
         caption = p_caption)

The ggrepel package has several other useful geoms and options to aid with effectively plotting labels along with points. The performance of its labeling algorithm is consistently very good. For many purposes it will be a better first choice than geom_text().

5.3 Label Outliers

Sometimes we want to pick out some points of interest in the data without labeling every single item. We can still use geom_text() or geom_text_repel(). We just need to pick out the points we want to label. In the code above, we do this on the fly by telling geom_text_repel() to use a different data set from the one geom_point() is using. We do this using the subset() function.

Top: Labeling text according to a single criterion. Bottom: Labeling according to several criteria.Top: Labeling text according to a single criterion. Bottom: Labeling according to several criteria. Figure 5.17: Top: Labeling text according to a single criterion. Bottom: Labeling according to several criteria.

p <- ggplot(data = by_country,
            mapping = aes(x = gdp, y = health))

p + geom_point() +
    geom_text_repel(data = subset(by_country, gdp > 25000),
                    mapping = aes(label = country))

p <- ggplot(data = by_country,
            mapping = aes(x = gdp, y = health))

p + geom_point() +
    geom_text_repel(data = subset(by_country,
                     gdp > 25000 | health < 1500 | country %in% "Belgium"),
                    mapping = aes(label = country))

In the first figure, we specify a new data argument to the text geom, and use subset() to create a small dataset on the fly. The subset() function takes the by_country object and selects only the cases where gdp is over 25,000, with the result that only those points are labeled in the plot. The criteria we use can be whatever we like, as long as we can write a logical expression that defines it. For example, in the lower figure we pick out cases where gdp is greater than 25,000, or health is less than 1,500, or the country is Belgium. In all of these plots, because we are using geom_text_repel(), we no longer have to worry about our earlier problem where the country labels were clipped at the edge of the plot.

Alternatively, we can pick out specific points by creating a dummy variable in the data set just for this purpose. Here we add a column to organdata called ind. An observation gets coded as TRUE if ccode is “Ita”, or “Spa”, and if the year is greater than 1998. We use this new ind variable in two ways in the plotting code. First, we map it to the color aesthetic in the usual way. Second, we use it to subset the data that the text geom will label. Then we suppress the legend that would otherwise appear for the label and color aesthetics by using the guides() function.

Labeling using a dummy variable. Figure 5.18: Labeling using a dummy variable.

organdata$ind <- organdata$ccode %in% c("Ita", "Spa") &
                    organdata$year > 1998

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors, color = ind))
p + geom_point() +
    geom_text_repel(data = subset(organdata, ind),
                    mapping = aes(label = ccode)) +
    guides(label = FALSE, color = FALSE)

5.4 Write and Draw in the Plot Area

Sometimes we want to annotate the figure directly. Maybe we need to point out something important that is not mapped to a variable. We use annotate() for this purpose. It isn’t quite a geom, as it doesn’t accept any variable mappings from our data. Instead, it can use geoms, temporarily taking advantage of their features in order to place something on the plot. The most obvious use-case is putting arbitrary text on the plot.

We will tell annotate() to use a text geom. It hands the plotting duties to geom_text(), which means that we can use all of that geom’s arguments in the annotate() call. This includes the x, y, and label arguments, as one would expect, but also things like size, color, and the hjust and vjust settings that allow text to be justified. This is particularly useful when our label has several lines in it. We include extra lines by using the special “newline” code, \n, which we use instead of a space to force a line-break as needed.

Arbitrary text with <code>annotate()</code>. Figure 5.19: Arbitrary text with annotate().

p <- ggplot(data = organdata, mapping = aes(x = roads, y = donors))
p + geom_point() + annotate(geom = "text", x = 91, y = 33,
                            label = "A surprisingly high \n recovery rate.",
                            hjust = 0)

The annotate() function can work with other geoms, too. Use it to draw rectangles, line segments, and arrows. Just remember to pass along the right arguments to the geom you use. We can add a rectangle to this plot, for instance, with a second call to the function.

Using two different geoms with <code>annotate()</code>. Figure 5.20: Using two different geoms with annotate().

p <- ggplot(data = organdata,
            mapping = aes(x = roads, y = donors))
p + geom_point() +
    annotate(geom = "rect", xmin = 125, xmax = 155,
             ymin = 30, ymax = 35, fill = "red", alpha = 0.2) + 
    annotate(geom = "text", x = 157, y = 33,
             label = "A surprisingly high \n recovery rate.", hjust = 0)

5.5 Understanding Scales, Guides, and Themes

This chapter has gradually extended our ggplot vocabulary in two ways. First, we introduced some new geom_ functions that allowed us to draw new kinds of plots. Second, we made use of new functions controlling some aspects of the appearance of our graph. We used scale_x_log10(), scale_x_continuous() and other scale_ functions to adjust axis labels. We used the guides() function to remove the legends for a color mapping and a label mapping. And we also used the theme() function to move the position of a legend from the side to the top of a figure.

Learning about new geoms extended what we have seen already. Each geom makes a different type of plot. Different plots require different mappings in order to work, and so each geom_ function takes mappings tailored to the kind of graph it draws. You can’t use geom_point() to make a scatterplot without supplying an x and a y mapping, for example. Using geom_histogram() only requires you to supply an x mapping. Similarly, geom_pointrange() requires ymin and ymax mappings in order to know where to draw the lineranges it makes. Besides the mappings, a geom_ function will take other arguments suited to it as well. When using geom_boxplot() you can specify what the outliers look like using arguments like outlier.shape and outlier.color, or you may use varwidth to make the widths of the boxes proportional to the square root of the number of observations.

The second kind of extension introduced some new functions. With them came some new concepts. What are the differences between the scale_ functions, the guides() function, and the theme() function? When do you know to use one rather than the other? Why are there so many scale_ functions listed in the online help, anyway? How can you tell which one you need?

Rather than provide exhaustive definitions, here is a rough and ready starting point, just to get oriented:

  • Every aesthetic mapping has a scale. If you want to adjust how that scale is marked or graduated, then you use a scale_ function.
  • Many scales come with a legend or key to help the reader interpret the graph. These are called guides. You can make adjustments to them with the guides() function. Perhaps the most common use case is to make the legend disappear, as it is sometimes superfluous. Another is to adjust the arrangement of the key in legends and colorbars.
  • Graphs have many other features that are not strictly connected to the logical structure of the data being displayed. These include things like their background color, the typeface used for labels, or the placement of the legend on the graph. To adjust these, use the theme() function.

Consistent with ggplot’s overall approach, adjusting some visible feature of the graph means first thinking about the relationship that feature has with the underlying data. Roughly speaking, if the change you want to make will affect the interpretation of any particular geom, then most likely you will either be mapping an aesthetic to a variable using that geom’s aes() function, or you will be specifying a change via some scale_ function. If the change you want to make does not affect the interpretation of a given geom_, then most likely you will either be setting a variable inside the geom_ function or making a cosmetic change via the theme() function.

These functions can be confusing because scales and guides are closely connected. The guide provides information about the scale, such as in a legend or colorbar. Thus, it is possible to make adjustments to guides from inside the various scale_ functions if you like. But most often it is easier to use the guides() function.

Every mapped variable has a scale. Figure 5.21: Every mapped variable has a scale.

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point()

Every mapped variable has a scale. Figure 5.21 shows a plot with three aesthetic mappings. The variable roads is mapped to x; donors is mapped to y; and world is mapped to color. The x and y scales are both continuous, running smoothly from just under the lowest value of the variable to just over the highest value. Various labeled tick marks orient the reader to the values on each axis. The color mapping also has a scale. The world measure is an unordered categorical variable, so its scale is discrete. It takes one of four values, each represented by a different color.

Along with color, other mappings like fill, shape, and size will have scales that we might want to customize or adjust. We could, for example, have mapped world to shape instead of color. In that case our four-category variable would have a scale consisting of four different shapes. If we want to make adjustments to the scales for these mappings—features like their labels, the position of axis tick marks, or the particular colors or shapes used on the scale, then we use one of the scale_ family of functions.

Mappings can represent many different kinds of variable. Most often, x and y are continuous measures. But they might also easily be discrete, as when mapped county names to the y axis in our boxplots and dotplots. As we have also seen, an x or y mapping can be defined as a transformation onto a log scale, or as a special sort of number value like a date. Similarly, a color or fill mappings can be discrete and unordered, as with our world variable, or discrete and ordered, as with letter grades in an exam. A color or fill mapping can also be a continuous quantity represented as a gradient running smoothly from a high to low value. Finally, both continuous gradients and ordered discrete values might it also have a defined neutral midpoint with extremes diverging in both directions.

A template for the <code>scale</code> functions. Figure 5.22: A template for the scale functions.

There are many possibilities. Because we have a variety of mappings, and because each mapping might be to one of several different scales, we end up with a lot of individual scale_ functions. Each deals with one combination of mapping and scale. They are named according to a consistent logic, shown in Figure 5.22. First comes the scale_ name, then the mapping it applies to, and finally the kind of value the scale will display. Thus, the scale_x_continuous() function controls x scales for continuous variables; scale_y_discrete() adjusts y scales for discrete variables; and scale_x_log10() transforms an x mapping to a log scale. Most of the time, ggplot will guess correctly what sort of scale is needed for your mapping. Then it will work out some default features of the scale (such as its labels and where the tick marks go). In many cases you will not need to make any scale adjustments. If x is mapped to a continuous variable then adding + scale_x_continuous() to your plot statement with no further arguments will have no effect. It is already there implicitly. Adding + scale_x_log10(), on the other hand, will transform your scale, as now you have replaced the default treatment of a continuous x variable.

If you want to adjust the labels or tick marks on a scale, you will need to know which mapping it is for and what sort of scale it is. Then you supply the arguments to the appropriate scale function. For example, we can change the x-axis of the previous plot to a log scale, and then also change the position and labels of the tick marks on the y-axis.

Making some scale adjustments. Figure 5.23: Making some scale adjustments.

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point() +
    scale_x_log10() +
    scale_y_continuous(breaks = c(5, 15, 25),
                       labels = c("Five", "Fifteen", "Twenty Five"))

The same applies to mappings like color and fill. Here the available scale_ functions include ones that deal with continuous, diverging, and discrete variables, as well as others that we will encounter later when we discuss the use of color and color palettes in more detail. When working with a scale that produces a legend, we can also use this its scale_ function to specify the labels in the key. To change the title of the legend, however, we use the labs() function, which lets us label all the mappings.

Relabeling via a scale function. Figure 5.24: Relabeling via a scale function.

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point() +
    scale_color_discrete(labels =
                             c("Corporatist", "Liberal",
                               "Social Democratic", "Unclassified")) +
    labs(x = "Road Deaths",
         y = "Donor Procurement",
        color = "Welfare State")

If we want to move the legend somewhere else on the plot, we are making a purely cosmetic decision and that is the job of the theme() function. As we have already seen, adding + theme(legend.position = "top") will move the legend as instructed. Finally, to make the legend disappear altogether, we tell ggplot that we do not want a guide for that scale. This is generally not good practice, but there can be good reasons to do it. We will see some examples later on.

Removing the guide to a scale. Figure 5.25: Removing the guide to a scale.

p <- ggplot(data = organdata,
            mapping = aes(x = roads,
                          y = donors,
                          color = world))
p + geom_point() +
    labs(x = "Road Deaths",
         y = "Donor Procurement") +
    guides(color = FALSE)

We will look more closely at scale_ and theme() functions in Chapter 8, when we discuss how to polish plots that we are ready to display or publish. Until then, we will use scale_ functions fairly regularly to make small adjustments to the labels and axes of our graphs. And we will occasionally use the theme() function to make some cosmetic adjustments here and there. So you do not need to worry too much about additional details of how they work until later on. But at this point it is worth knowing what scale_ functions are for, and the logic behind their naming scheme. Understanding the scale_<mapping>_<kind>() rule makes it easier to see what is going on when one of these functions is called to make an adjustment to a plot.