4 Group, Facet, and Transform

Code almost never works properly the first time you write it. This is the main reason that, when learning a new language, it is important to type out the exercises and follow along manually. It gives you a much better sense of how the syntax of the language works, where you’re likely to make errors, and what the computer does when that happens. Running into bugs and errors is frustrating, but it’s also an opportunity to learn a bit more. Errors can be obscure but they are usually not malicious or random. If something has gone wrong, you can find out why it happened.

In R and ggplot, errors in code can result in figures that don’t look right. We have already seen the result of one of the most common problems, when an aesthetic is mistakenly set to a constant value instead of being mapped to a variable. In this chapter we will discuss some useful features of ggplot that also commonly cause trouble for new users. They have to do with how to tell ggplot more about the internal structure of your data (group), how to break up your data into pieces for a plot (facet), and how to get ggplot to perform some calculations on or summarize your data before producing the plot (transform) As before, we will proceed by repeated example, as this is how visualization usually works in practice.

4.1 Colorless Green Data Sleeps Furiously

When you’re working with ggplot you are in effect trying to “say” something visually. It usually takes several iterations to say exactly what you meant. This is more than a metaphor here. The software is an implementation of the “grammar” of graphics, an idea developed by Wilkinson (2005). The grammar is a set of rules for producing graphics from data, taking pieces of data and mapping them to geometric objects (like points and lines) that have aesthetic attributes (like position, color and size), together with further rules for transforming the data if needed (e.g. to a smoothed line), adjusting scales (e.g. to a log scale), and projecting the results onto a different coordinate system (usually cartesian).We will see some alternatives to cartesian coordinates later.

A key point is that, like other rules of syntax, the grammar limits the structure of what you can say, but it does not automatically make what you say sensible or meaningful. It allows you to produce “sentences”—mappings of data to objects together with rules for scaling and so on—but these can easily be garbled. Sometimes your code will not produce a plot at all because of some syntax error in R. You will forget a + sign between geom_ functions, or lose a parenthesis somewhere so that your function statement becomes unbalanced. In those cases R will complain that something has gone wrong. At other times, your code will successfully produce a plot, but its content will unexpectedly look garbled or wrong. In that case, the chances are you have given the a series of grammatically correct instructions that are either nonsensical in some way, or fall short of what you meant to say. These problems often arise when ggplot does not have the information it needs in order make your graphic say what you want it to say.

4.2 Grouped Data and the “group” Aesthetic

Let’s begin again with our Gapminder dataset. Imagine we wanted to plot the trajectory of life expectancy over time for each country in the data. We map year to x and lifeExp to y. We take a quick look at the documentation and discover that geom_line() will draw lines by connecting observations in order of the variable on the x-axis, which seems right. We write our code:

Trying to plot the data over time by country. Figure 4.1: Trying to plot the data over time by country.

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))
p + geom_line() 

Something has gone wrong. What happened? While ggplot will make a pretty good guess as to the structure of the data, it does not know that the yearly observations in the data are grouped by country. We have to tell it. Because we have not, geom_line() gamely tries to join up all the lines for each particular year in the order they appear in the dataset, as promised. It starts with an observation for 1952 in the first row of the data. It doesn’t know this belongs to Afghanistan. Instead of going to Afghanistan 1953, it finds there are a series of 1952 observations, so it joins all of those up first—all the way down to the 1952 observation that belongs to Zimbabwe. Then it moves to the first observation in the next year, 1957.This would have worked if there were only one country in the dataset.

The result is meaningless when plotted. Bizarre-looking output in ggplot is common enough, because everyone works out their plots one bit at a time, and making mistakes is just a feature of puzzling out how you want the plot to look. When ggplot successfully makes a plot but the result looks insane, the reason is almost always that something has gone wrong in the mapping between the data and aesthetics for the geom being used. This is so common there’s even a Twitter account devoted to the “Accidental aRt” that results. So don’t despair!

In this case, we can use the group aesthetic to tell ggplot explicitly about this country-level structure.

Plotting the data over time by country, again. Figure 4.2: Plotting the data over time by country, again.

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))
p + geom_line(aes(group=country)) 

The plot here is still fairly rough, but it is showing the data properly, with each line representing the trajectory of a country over time. The gigantic outlier is Kuwait, in case you are interested.

The group aesthetic is usually only needed when the grouping information you need to tell ggplot about is not built-in to the variables being mapped. For example, when we were plotting the points by continent, mapping color to continent was enough to get the right answer, because continent is already a categorical variable, so the grouping is clear. When mapping the x to year, however, there is no information in the year variable itself to let ggplot know that it is grouped by country for the purposes of drawing lines with it. So we need to say that explicitly.

4.3 Facet to make Small Multiples

The plot we just made has a lot of lines on it. While the overall trend is more or less clear, it looks a little messy. One option is to facet the data by some third variable, making a “small multiple” plot. This is a very powerful technique that allows a lot of information to be presented compactly, and in a consistently comparable way. A separate panel is drawn for each value of the faceting variable. Facets are not a geom, but rather a way of organizing a series of geoms. In this case we have the continent variable available to us. We will use facet_wrap() to split our plot by continent.

The facet_wrap() function can take a series of arguments, but the most important is the first one, which is specified using R’s “formula” syntax, which uses the tilde character, ~. Facets are usually a one-sided formula. Most of the time you will just want a single variable on the right side of the formula. But faceting is powerful enough to accommodate what are in effect the graphical equivalent of multi-way contingency tables, if your data is complex enough to require that. For now, we will just use a single term in our formula, which is the variable we want the data broken up by: facet_wrap(~ continent).

Faceting by continent. Figure 4.3: Faceting by continent.

p <- ggplot(data = gapminder,
            mapping = aes(x = year,
                          y = gdpPercap))
p + geom_line(aes(group = country)) + facet_wrap(~ continent)

Notice how the scales are efficiently laid out, and each facet is labeled at the top. Remember that we can still include other geoms as before, and they will be layered within each facet. We can also use the ncol argument to facet_wrap() to control the number of columns used to lay out the facets. Because we have only five continents it might be worth seeing if we can fit them on a single row (which means we’ll have five columns). In addition, we can add a smoother, and a few cosmetic enhancements that make the graph a little more effective. The result will be a plot that is a little more complex, but hopefully the additive nature of the plotting process will make it easier to understand.

p <- ggplot(data = gapminder, mapping = aes(x = year, y = gdpPercap))
p + geom_line(color="gray70", aes(group = country)) +
    geom_smooth(size = 1.1, method = "loess", se = FALSE) +
    scale_y_log10(labels=scales::dollar) +
    facet_wrap(~ continent, ncol = 5) +
    labs(x = "Year",
         y = "GDP per capita",
         title = "GDP per capita on Five Continents")

Faceting by continent, again.

Figure 4.4: Faceting by continent, again.

This plotWe could also have faceted by country, which would have made the group mapping superfluous. But that would make almost a hundred and fifty panels. brings together a basic aesthetic mapping of x and y variables, a grouping aesthetic (country), two geoms (a lineplot and a smoother), a log-transformed y-axis with appropriate tick labels, a faceting variable (continent), and finally axis labels and a title.

The facet_wrap() function is best used when you want a series of small multiples based on a single categorical variable. Your panels will be laid out in order and then wrapped into a grid. If you wish you can specify the number or rows or the number of columns in the resulting layout. Facets can be more complex than this. For instance, you might want to cross-classify some data by two categorical variables. In that case you should try facet_grid() instead. This function will lay out your plot in a true two-dimensional arrangement, instead of a series of panels wrapped into a grid.

To see the difference, let’s briefly switch to the diamonds dataset that is included with the tidyverse. The diamonds dataset has information on more than fifty thousand diamonds, including continuous measures of each diamond’s carat weight and price, and categorical measures of its cut and clarity. The price of a diamond depends in large part on its carat weight, but perhaps this relationship is affected in different ways by the combination of its cut and color. To examine this we will make a smoothed scatterplot of carat versus price, and create a grid of panels cross-classifying cut and color. We use the formula notation in the function to facet cut on color. This time, the formula is two-sided: facet_grid(cut ~ color).

p <- ggplot(data = diamonds, mapping = aes(x = carat, y = price))
p + geom_smooth(alpha = 0.3) + facet_grid(cut ~ color)

Figure 4.5: Faceting on two categorical variables. Each panel shows the relationship between carat and price, with the facets breaking out the data by cut (in the rows) and color (in the columns).

Faceting on two categorical variables. Each panel shows the relationship between carat and price, with the facets breaking out the data by cut (in the rows) and color (in the columns).

Further categorical variables can be added to the formula, too, (e.g. cut ~ color + clarity) although the multiple dimensions of plots like this will get complicated very quickly if your variables have more than a few categories each. You might also investigate the difference between a faceting formula written as facet_grid(cut ~ color) versus one written as facet_grid(~ cut + color).

Multi-panel layouts of this kind are effective when used to summarize continuous variation (as in a scatterplot) across two or more categorical variables, with the categories (and hence the panels) ordered in some sensible way. Like facet_grid(), the facet_wrap() function can facet on two or more variables at once, as well. But it will do it by laying the results out in a wrapped one-dimensional table instead of a fully cross-classified grid. See the difference for yourself by replacing the call to facet_grid() with facet_wrap() in this plot.

4.4 Geoms can Transform Data

We have already seen several examples where geom_smooth() was included as a way to add a trend line to the figure. Sometimes we plotted a LOESS line, sometimes a straight line from an OLS regression, and sometimes the result of a Generalized Additive Model. We did not have to write any code to specify these models, beyond telling the method argument in geom_smooth() which one we wanted to use. Thus, some geoms plot our data directly on the figure, as is the case with geom_point(), which takes variables designated as x and y and plots the points on a grid. But other geoms clearly do more work on the data before it gets plotted.Try p + stat_smooth(), for example. Every geom_ function has an associated stat_ function that it uses by default. The reverse is also the case: every stat_ function has an associated geom_ function that it will plot by default if you ask it to. This is not particularly important to know by itself, but—as we will see in the next section—we sometimes want to calculate a different statistic for the geom from the default.

Sometimes the calculations being done by the stat_ functions that work together with the geom_ functions might not be immediately obvious. For example, consider this figure produced by a new geom, geom_bar().

A bar chart. Figure 4.6: A bar chart.

p <- ggplot(data = gapminder,
            mapping = aes(x = continent))
p + geom_bar()

Here we specified just one mapping, aes(x = continent). The bar chart produced gives us a count of the number of (country-year) observations in the data set by continent. This seems sensible. But there is a y-axis variable here, count, that is not in the data. Ggplot has calculated it for us. It does this using the default stat_ function associated with it, stat_count(). This function computes two new variables, count, and prop (short for proportion). The count statistic is the one geom_bar() uses by default.

A first go at a bar chart with proportions. Figure 4.7: A first go at a bar chart with proportions.

p <- ggplot(data = gapminder,
            mapping = aes(x = continent))
p + geom_bar(mapping = aes(y = ..prop..))

If we want a chart of relative frequencies rather than counts, we will need to get the prop statistic instead. When ggplot calculates the count or the proportion, it returns temporary variables that we can use as mappings in our plots. As you can see from the code, the relevant statistic is called ..prop.. rather than prop. To make sure these temporary variables won’t be confused with others we are working with, their names begin and end with two periods. (We might already have a variable called count or prop in our dataset.) So our calls to it from the aes() function will generically look like this: <mapping> = <..statistic..>. In this case, we want y to use the calculated proportion, so we say aes(y = ..prop..).

The resulting plot is still not right. We no longer have a count on the y-axis, but the proportions of the bars all have a value of 1, so all the bars are the same height. We want them to sum to 1, so that we get the number of observations per continent as a proportion of the total number of observations. This is a grouping issue again. In a sense, it’s the reverse of the earlier grouping problem we faced when we needed to tell ggplot that our yearly data was grouped by country. In this case, we need to tell ggplot to ignore the x-categories when calculating denominator of the proportion, and use the total number observations instead. To do so we specify group = 1 inside the aes() call. The value of 1 is just a kind of “dummy group” that tells ggplot to use the whole dataset when establishing the denominator for its prop calculations.

A bar chart with correct proportions. Figure 4.8: A bar chart with correct proportions.

p <- ggplot(data = gapminder,
            mapping = aes(x = continent))
p + geom_bar(mapping = aes(y = ..prop.., group = 1)) 

To show some other features of geom_bar() and related plot types, let’s switch away from gapminder to a new dataset. The gapminder data consists mostly of continuous variables measured within countries by year. The only categorical grouping variable is the continent the countries are on. Very often in social scientific work—especially when it comes to the analysis of individual level survey data—we find ourselves dealing with categorical data of various kinds, whether ordered (e.g., levels of education) or unordered (e.g., ethnicity). Opinion questions may be asked in Yes or No terms, or on a five or seven point scale with a neutral value in the middle. Meanwhile, even many numeric measures—such as number of children, for instance—can only take integer values within a relatively narrow range, and thus may be treated in practice as ordered categorical variables running from zero to some top-coded value such as “Six or more”. Even properly continuous measures, such as income, are often only obtainable as ordered categories.

TheTo begin with, we will use the GSS data in a relatively basic way. In particular we will not consider sample weights when making the figures in this chapter. In Chapter 6 we will learn how to calculate frequencies and other statistics from data with a complex or weighted survey design. socviz library contains a dataset called gss_sm. It is a small subset of the questions from the 2016 General Social Survey, or GSS. The GSS is a long-running survey of American adults that asks about a range of topics of interest to social scientists. The religion variable is derived from a question asking “What is your religious preference? Is it Protestant, Catholic, Jewish, some other religion, or no religion?”Recall that the $ character is one way of accessing individual columns within a data frame.

table(gss_sm$religion)

## 
## Protestant   Catholic     Jewish       None      Other 
##       1371        649         51        619        159

To graph this, we want a bar chart with religion on the x axis (as a categorical variable), and with the bars in the chart also colored by religion. If we want the bars filled with color, we should map the religion variable to fill. Remember, fill is for painting the inner areas of shapes. If we map cut to color, only the border lines of the bars will be assigned colors, and the insides will remain gray.

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, color = religion))
p + geom_bar()

p <- ggplot(data = gss_sm,
            mapping = aes(x = religion, fill = religion))
p + geom_bar() + guides(fill = FALSE) 

Figure 4.9: GSS Religious Preference mapped to color (left) and both color and fill (right).

GSS Religious Preference mapped to color (left) and both color and fill (right).GSS Religious Preference mapped to color (left) and both color and fill (right).

By doing this, we have mapped two aesthetics to the same variable. Both x and fill are mapped to religion. There is nothing wrong with doing this. But strictly speaking, these are still two separate mappings, and so the default is to show a legend for the color variable. This legend is redundant, because the categories of religion are already separated out on the x-axis. In its simplest use, the guides() function controls whether guiding information about any particular mapping appears or not. If we set guides(fill = FALSE), the legend is removed, in effect saying that the viewer of the figure does not need to be shown any guiding information about this mapping. Setting the guide for some mapping to FALSE only has an effect if there is a legend to turn off to begin with. Trying x = FALSE or y = FALSE will have no effect, as these mappings have no additional guides or legends separate from their scales. It is possible to turn the x and y scales off altogether, but this is done though a different function (one in the scale_ family).

A more interesting use of the fill aesthetic with geom_bar() is to show two categorical variables at once. For example, let’s say we want to look at the bigregion variable broken down by religion. When we cross-classify categories in bar charts, there are several ways to display the results. With geom_bar() the output is controlled by the position argument. Let’s begin by mapping fill to religion.

A stacked bar chart of Religious Preference by Census Region. Figure 4.10: A stacked bar chart of Religious Preference by Census Region.

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar()

The default output of geom_bar() is a stacked bar chart, with counts on the y-axis (and hence counts within the stacked segments of the bars also). Region of the country is on the x-axis, and counts of religious preference stacked within the bars. As we saw in Chapter 2, it is somewhat difficult for readers of the chart to compare lengths an areas on an unaligned scale. So while the relative position of the bottom categories are quite clear (thanks to them all being aligned on the x-axis), the relative positions of say, the “Catholic” category is harder to assess. An alternative choice is to set the position argument to "fill". (Note that this is different from the fill aesthetic.)

Using the fill position adjustment to show relative proportions across categories. Figure 4.11: Using the fill position adjustment to show relative proportions across categories.

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "fill")

Now the bars are all the same height, which makes it easier to compare proportions across groups, although we lose the ability to see the relative size of each cut with respect to the overall total. What if we wanted to show the proportion or percentage of religions within regions of the country, like in Figure 4.11, but instead of stacking the bars we wanted separate bars instead? As a first cut, we can use position="dodge" to make the bars within each region of the country appear side by side. However, if we do it this way (try it), we will find that ggplot places the bars side-by-side as intended, but changes the y-axis back to a count of cases within each category rather than showing us a proportion. We saw in Figure 4.8 that to display a proportion we needed to map y = ..prop.., so the correct statistic would be calculated. Let’s see if that works.

A first go at a dodged bar chart with proportional bars. Figure 4.12: A first go at a dodged bar chart with proportional bars.

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop..))

The result is certainly colorful, but it is still not what we wanted. Just as in Figure 4.12, there seems to be an issue with the grouping. When we just wanted the overall proportions for one variable, we mapped group = 1 to tell ggplot to calculate the proportions with respect to the overall N. In this case our grouping variable is religion, so we might try mapping that to the group aesthetic.

A second attempt at a dodged bar chart with proportional bars. Figure 4.13: A second attempt at a dodged bar chart with proportional bars.

p <- ggplot(data = gss_sm,
            mapping = aes(x = bigregion, fill = religion))
p + geom_bar(position = "dodge",
             mapping = aes(y = ..prop.., group = religion)) 

This gives us a bar chart where the values of religion are broken down across regions, with a proportion showing on the y-axis. It is still not quite right, however. If you inspect the bars in Figure 4.13, you will see that they do not sum to one within each region. Instead, the bars for any particular religion sum to one across regions. By default in these cases, ggplot will show us the marginal distribution of religion across regions of the country. That is, the graph reflects the percentages shown in Table 4.1.

This lets us see that nearly half of those who said they were Protestant live in the South, for example. Meanwhile, just over ten percent of those saying they were Protestant live in the Northeast. Similarly, it shows that over half of those saying they were Jewish live in the Northeast, compared to about a quarter who live in the South.Proportions for smaller sub-populations like this tend to bounce around from year to year in the GSS. However, if we wanted a marginal comparison of this kind, we might have been be better off just making a version of Figure 4.10 with the mappings flipped around: map religion to the x-axis and bigregion the fill aesthetic. (You can try this and see what it looks like.)

Table 4.1: Column marginals.

Protestant Catholic Jewish None Other NA
Northeast 11.5 25.0 52.9 18.1 17.6 5.6
Midwest 23.7 26.5 5.9 25.4 20.8 27.8
South 47.4 24.7 21.6 27.5 31.4 61.1
West 17.4 23.9 19.6 29.1 30.2 5.6

4.5 Pipe your Data to Summarize and Transform it

We are still not where we originally wanted to be. Our goal was to take the stacked bar chart in Figure 4.11 and have the proportions be shown side-by-side instead of on top of one another. In other words, we want the marginal percentages by row rather than by column. These are shown in Table 4.2. It is possible to write code to do this just within ggplot’s own functions. But in practice, code of this sort can become a little unwieldy. It is too easy to get confused about whether one has calculated row margins, column margins, or overall relative frequencies. The code to do the calculations on the fly ends up getting stuffed into the mapping function and can become hard to read. A better strategy is to calculate the frequency table you want first, and then plot that table. This has the benefit of allowing you do to some quick sanity checks on your output to make sure you haven’t made any errors.

Table 4.2: Row marginals.

Protestant Catholic Jewish None Other NA
Northeast 32.4 33.2 5.5 23.0 5.7 0.2
Midwest 46.8 24.7 0.4 22.6 4.7 0.7
South 61.8 15.2 1.0 16.2 4.8 1.0
West 37.7 24.5 1.6 28.5 7.6 0.2

We will take the opportunity to do a little bit of data-munging in order to get from our underlying table of GSS data to the summary tabulation that we want to plot. To do this we will use the tools provided by dplyr, a component of the tidyverse library that provides functions for manipulating and reshaping tables of data on the fly. We start from our individual-level gss_sm data frame with its bigregion and religion variables. Our goal is a summary table with percentages of religious preferences grouped within region.

Figure 4.14: How we want to transform the individual-level data.

How we want to transform the individual-level data.

As shown schematically in Figure 4.14, we will start with our individual-level table of about 2,500 GSS respondents. Then we want to summarize them into a new table that shows a count of each religious preference, grouped by region. Finally we will turn these within-region counts into percentages, where the denominator is the total number of respondents within each region. The dplyr library provides a few tools to make this easy and clear to read. We will use a special operator, %>%, to do our work. This is the pipe operator. It plays the role of the yellow triangle in Figure 4.14, in that it helps us perform the actions that get us from one table to the next.

We have being building our plots in an additive fashion, starting with a basic ggplot object and layering on new elements. By analogy, think of the %>% operator as allowing us to start with a data frame and perform a sequence or pipeline of operations to turn it into another, usually smaller and more aggregated table. Data goes in one side of the pipe, actions are performed via functions, and results come out the other. A pipeline is typically a series of operations that do one or more of four things:

  • Groupgroup_by() the data into the nested structure we want for our summary, such as “Religion by Region” or “Authors by Publications by Year”.
  • Filterfilter() rows; select() columns or select pieces of the data by row, column, or both. This gets us the piece of the table we want to work on.
  • Mutatemutate() the data by creating new variables at the current level of grouping. This adds new columns to the table.
  • Summarizesummarize() or aggregate the grouped data. This creates new variables at a higher level of grouping. For example we might calculate means with mean() or counts with n(). This results in a smaller, summary table, which we might do more things on if we want.

We use the dplyr functions group_by(), filter(), select(), mutate(), and summarize() to carry out these tasks within our pipeline. They are written in a way that allows them to be easily piped. That is, they understand how to take inputs from the left side of a pipe operator and pass results along through the right side of one. The dplyr documentation has some useful vignettes that introduce these grouping, filtering, selection, and transformation functions. There is also a more detailed discussion of these tools, along with many more examples, in Grolemund and Wickham (2016).

We will create a new table called rel_by_region. Here’s the code:

rel_by_region <- gss_sm %>%
    group_by(bigregion, religion) %>%
    summarize(N = n()) %>%
    mutate(freq = N / sum(N),
           pct = round((freq*100), 1))

To read these lines, first note that we are creating an object as usual, with the familiar assignment operator, <-. Then look at the steps to the right. Read the objects and functions from left to right, with the pipe operator “%>%” connecting them together meaning “and then …”. Objects on the left hand side “pass through” the pipe, and whatever is specified on the right of the pipe gets done to that object. The resulting object then passes through to the right again, and so on down to the end of the pipeline.

Reading from the left, then, the lines say this:

  • Createrel_by_region <- gss_sm %>% a new object, rel_by_region. Starting with the gss_sm data, it gets the result of the following steps:
  • Groupgroup_by(bigregion, religion) %>% the rows by bigregion and, within that, by religion.
  • Summarize this tablesummarize(N = n()) %>% to create a new, much smaller table, with three columns: bigregion, religion, and a new summary variable, N, that is a count of the number of observations within each religious group for each region.
  • With this new table, mutate(freq = N / sum(N), pct = round((freq*100), 1)) use the N variable to calculate two new columns: the relative proportion (freq) and percentage (pct) for each religious category, still grouped by region. Round the results to one decimal place.

In this way of doing things, objects passed along the pipeline and the functions acting on them carry some assumptions about their context. For one thing, you don’t have to keep specifying the name of the underlying data frame object you are working from—everything is implicitly carried forward from gss_sm. Within the pipeline, the transient or implicit objects created from your summaries and other transformations are carried through, too.

Second, the group_by() function sets up how the grouped or nested data will be processed within the summarize() step. Any function used to create a new variable within summarize(), such as mean() or sd() or n(), will be applied to the innermost grouping level first. Grouping levels are named from left to right within group_by() from outermost to innermost. So the function call summarize(N = n()) counts up the number of observations for each value of religion within bigregion and puts them in a new variable named N. As dplyr’s functions see things, summarizing actions “peel off” one grouping level at a time, so that the resulting summaries are at the next level up. In this case, we start with individual-level observations and group them by religion within region. The summarize() operation aggregates the individual observations to counts of the number of people affiliated with each religion, for each region.

Third, the mutate() step takes the N variable and uses it to create freq, the relative frequency for each subgroup within region, and finally pct, the relative frequency turned into a rounded percentage. These mutate() operations add or remove columns from tables, but do not change the grouping level. Notice that inside both mutate() and summarize(), we are able to create new variables in a way that we have not seen before.As in the case of aes(x = gdpPercap, y = lifeExp), for example. Usually, when we see something like name = value inside a function, the name is a general, named argument and the function is expecting information from us about the specific value it should take. Normally if we give a function a named argument it doesn’t know about (aes(chuckles = year)) it will ignore it, complain, or break. With summarize() and mutate(), however, we can invent named arguments. We are still assigning specific values to N, freq, and pct, but we pick the names, too. They are the names that the newly-created variables in the summary table will have. The summarize() and mutate() functions do not need to know what they will be in advance.

Finally, note that when we use mutate() to create the freq variable, not only can we make up that name within the function, mutate() is also clever enough to let us use that name right away, on the next line of the same function call, when we create the pct variable. This means we do not have to repeatedly write separate mutate() calls for every new variable we want to create.

Our pipeline takes the gss_sm data frame, which has 2867 rows and 32 columns, and transforms it into rel_by_region, a summary table with 24 rows and 5 columns that looks like this, in part:

rel_by_region

## # A tibble: 24 x 5
## # Groups:   bigregion [4]
##    bigregion   religion     N       freq   pct
##       <fctr>     <fctr> <int>      <dbl> <dbl>
##  1 Northeast Protestant   158 0.32377049  32.4
##  2 Northeast   Catholic   162 0.33196721  33.2
##  3 Northeast     Jewish    27 0.05532787   5.5
##  4 Northeast       None   112 0.22950820  23.0
##  5 Northeast      Other    28 0.05737705   5.7
##  6 Northeast       <NA>     1 0.00204918   0.2
##  7   Midwest Protestant   325 0.46762590  46.8
##  8   Midwest   Catholic   172 0.24748201  24.7
##  9   Midwest     Jewish     3 0.00431655   0.4
## 10   Midwest       None   157 0.22589928  22.6
## # ... with 14 more rows

Notice that the variables specified in group_by() are retained in the new summary table; the variables created with summarize() and mutate() are added, and all the other variables in the original dataset are dropped.

We said before that, when trying to grasp what each additive step in a ggplot() sequence does, it can be helpful to work backwards, removing one piece at a time to see what the plot looks like when that step is not included. In the same way, when looking at pipelined code it can be helpful to start from the end of the line, removing one “%>%” step at a time to see what the resulting intermediate object looks like. For instance, what if we remove the mutate() step from the code above? What does rel_by_region look like then? What if we remove the summarize() step? How big is the table returned at each step? What level of grouping is it at? What variables have been added or removed?

Plots that do not require sequential aggregation and transformation of the data before they are displayed are usually easy to write directly in ggplot, as the details of the layout are handled by a combination of mapping variables and layering geoms. One-step filtering or aggregation of the data (such as calculating a proportion, or a specific subset of observations) is also straightforward. But when the result we want to display is several steps removed from the data, and in particular when we want to group or aggregate a table and do some more calculations on the result before drawing anything, then it can make sense to use dplyr’s tools to produce these summary tables first—even if would also be possible to do it within a ggplot() call. In addition to making our code easier to read, it lets us more easily perform sanity checks on our results, so that we are sure we have grouped and summarized things in the right order. For instance, if we have done things properly with rel_by_region, the pct values associated with religion should sum to 100 within each region, perhaps with a bit of rounding error. We can quickly check this using a very short pipeline, too:

rel_by_region %>% group_by(bigregion) %>%
    summarize(total = sum(pct))

## # A tibble: 4 x 2
##   bigregion total
##      <fctr> <dbl>
## 1 Northeast 100.0
## 2   Midwest  99.9
## 3     South 100.0
## 4      West 100.1

This looks good. We can now draw the bar chart we originally wanted. Because we are working directly with percentage values in a summary table, however, we no longer have any need for ggplot to count up values for us or perform any proportion calculations. In other words we do not need the services of any stat_ functions that geom_bar() would normally call. So we can tell geom_bar() not to do any work on the variable before plotting it. To do this we say stat = 'identity' in the geom_bar() call. We’ll also move the legend to the top of the chart.

p <- ggplot(rel_by_region, aes(x = bigregion, y = pct, fill = religion))
p + geom_bar(position = "dodge", stat = "identity") +
    labs(x = "Region",y = "Percent", fill = "Religion") +
    theme(legend.position = "top")

Figure 4.15: Religious preferences by Region.

Religious preferences by Region.

The values in this chart are now equivalent to those in Figure 4.10, with each region summing to 100 percent.

We will see a little more of what dplyr’s grouping and filtering operations can do in the next chapter. For now, think of it as a way to quickly summarize tables of data without having to write too much code in the body of our ggplot() or geom_ functions.

4.6 Avoid Transformations when Necessary

As we have seen from the beginning, ggplot normally makes its charts starting from a full dataset. When we call geom_bar() it does its calculations on the fly using stat_count() behind the scenes to produce the counts or proportions it displays. In the previous section, we looked at a case where we wanted to group and aggregate our data ourselves before handing it off to ggplot. Often, the data we want to plot are in effect already a summary table. This can happen when we’ve computed a table of marginal frequencies or percentages from our original data already, as we just did in the previous section using our data pipeline. We are also in this situation when we are working with results from complex survey data (e.g., where we have taken sampling weights into account when calculating mean values or confidence intervals). As we shall see later on, plotting results from statistical models also puts us in this position.

The simplest case is when summary information is all that is available to us. For example, perhaps we do not have the individual-level data on who survived the Titanic disaster, but we do have a small table of counts of survivors by sex:

titanic

##       fate    sex    n percent
## 1 perished   male 1364    62.0
## 2 perished female  126     5.7
## 3 survived   male  367    16.7
## 4 survived female  344    15.6

Once again, if we want to make a bar chart of this data, no additional counting or other transformation is required. While we could type geom_bar(stat = "identity") as above, for convenience ggplot also provides geom_col(), which has exactly the same effect but is shorter.

Drowned and saved on the Titanic, by sex. Figure 4.16: Drowned and saved on the Titanic, by sex.

p <- ggplot(data = titanic,
            mapping = aes(x = fate, y = percent, fill = sex))
p + geom_col(position = "dodge") + theme(legend.position = "top")

The position argument in geom_bar() and geom_col() can also take the value of "identity". Just as stat = "identity" means “don’t do any summary calculations”, position = "identity" means “just plot the values as given”. This allows us to do things like, for example, plot a flow of positive and negative values in a bar chart. In the code below, we look at a small sample of monthly atmospheric CO2 concentrations measured at Mauna Loa. The data have a time variable (date), a CO2 measure (conc), a measure of each observation’s difference from the overall mean (diff), and a variable indicating whether the difference is positive or negative (pos). We plot the difference against time, color-coding each observation variable depending on whether it is above or below zero.

head(maunaloa)

##       conc       date  diff   pos
## 301 343.52 1984-01-01 -10.7 FALSE
## 302 344.33 1984-02-01  -9.9 FALSE
## 303 345.11 1984-03-01  -9.1 FALSE
## 304 346.88 1984-04-01  -7.3 FALSE
## 305 347.25 1984-05-01  -7.0 FALSE
## 306 346.62 1984-06-01  -7.6 FALSE

Using <code>geom\_col()</code> to plot negative and positive values in a bar chart. Figure 4.17: Using geom_col() to plot negative and positive values in a bar chart.

p <- ggplot(data = maunaloa,
            mapping = aes(x = date, y = diff, fill = pos))
p + geom_col() + guides(fill=FALSE)

As with the titanic plot, the default action of geom_col() is to set both stat and position to “identity”. To get the same effect with geom_bar() we would need to say geom_bar(position = "identity"). Note also the guides(fill=FALSE) instruction at the end. As before, this tells ggplot to drop the unnecessary legend that would otherwise be automatically generated to accompany the fill mapping.

4.7 Transformations in Histograms and Density Plots

We can see similar transformations at work when we want to summarize a continuous variable using a histogram. For example, ggplot comes with a dataset, midwest, containing county-level data for the midwestern United States. We can make a histogram of the county areas. Recall that because we are summarizing a continuous variable in a series of bars here, we need to divide the observations into groups, or “bins”, and count how many are in each interval. The geom_histogram() function will choose a bin size for us based on a rule of thumb.

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with
## `binwidth`.

Histograms of the same variable, with different numbers of bins.Histograms of the same variable, with different numbers of bins. Figure 4.18: Histograms of the same variable, with different numbers of bins.

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_histogram(bins = 10)

As with the bar charts, a newly-calculated variable, count, appears on the x-axis. The notification from R is telling us that the stat_bin() function picked 30 bins, but we might want to try something else. When drawing histograms it is worth experimenting with bins and also optionally the origin of the x-axis. Each, and especially bins, will make a big difference to how the resulting figure looks.

Frequency polygons, made with geom_freqpoly(), are closely related to histograms. Instead of displaying the count of observations using bars, they display it with a series of connected lines instead. You can try the various geom_histogram() calls in this section using geom_freqpoly() instead.

While histograms summarize single variables, it’s also possible to use several at once to compare distributions. We can facet histograms by some variable of interest, or—as here—we can compare them in the same plot using the fill mapping.

Comparing two histograms. Figure 4.19: Comparing two histograms.

OH.WI <- c("OH", "WI")

p <- ggplot(data = subset(midwest, subset = state %in% OH.WI),
            mapping = aes(x = percollege, fill = state))
p + geom_histogram(alpha = 0.4, bins = 20)

Note how we subset the data here to pick out just two states. We create a character vector with just two elements, “OH” and “WI”. Then we use the subset() function to take our data and filter it so that we only select rows whose state name is in this vector. The %in% operator is a convenient way to filter on more than one term in a variable when using subset().

When working with a continuous variable, an alternative to binning the data and making a histogram is to calculate a kernel density estimate of the underlying distribution. The geom_density() function will do this for us.

Kernel density estimate of county areas. Figure 4.20: Kernel density estimate of county areas.

p <- ggplot(data = midwest,
            mapping = aes(x = area))
p + geom_density()

We can use color (for the lines) and fill (for the body of the density curve) here, too. These figures often look quite nice, although when there are several filled areas on the plot—as in this case—the overlap can become hard to read. If you want to make the baselines of the density curves go away, you can use geom_line(stat = "density") instead. This also removes the possibility of using the fill aesthetic. But this may be an improvement in some cases. Try it with the plot of state areas and see how they compare.

Comparing distributions. Figure 4.21: Comparing distributions.

p <- ggplot(data = midwest,
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3)

Just like geom_bar(), the count-based defaults computed by the stat_ functions used by geom_histogram() and geom_density() also return proportional measures if asked. For geom_density(), the stat_density() function can return its default ..density.. statistic, or ..scaled.., which will give a proportional density estimate scaled to a maximum of one. It can also return a statistic called ..count.., which is the density times the number of points. This can be used in stacked density plots.

Scaled densities. Figure 4.22: Scaled densities.

p <- ggplot(data = subset(midwest, subset = state %in% OH.WI),
            mapping = aes(x = area, fill = state, color = state))
p + geom_density(alpha = 0.3, mapping = (aes(y = ..scaled..)))

At this point, we have a pretty good sense of the basic mechanics of visualizing our data. In fact, thanks to the ggplot’s default settings, we have the ability to make good-looking and informative plots. Starting with a tidy dataset, we know how to map variables to aesthetics, to choose from a variety of geoms, layer them if necessary, and make some adjustments to the scales of the plot. We also know more about selecting the right sort of computed statistic to show on the graph, if that’s what’s needed, and how to facet our core plot by one or more variables. We also know how to set descriptive labels for axes, and write a title, subtitle, and caption. From here we can expand our skills further by learning how to pick out and label data points on a plot, and how to order the data in our figures to make it easier to interpret.