3 Make a Plot

This Chapter will teach you how to use ggplot’s core functions to produce a series of scatterplots. From one point of view, we will proceed slowly and carefully, taking our time to understand the logic behind the commands that you type. The reason for this is that the central activity of visualizing data with ggplot more or less always involves the same sequence of steps. So it is worth learning what they are.

From another point of view, though, we will move fast. Once you have the basic sequence down, and understand how it is that ggplot assembles the pieces of a plot in to a final image, then you will find that analytically and aesthetically sophisticated plots come within your reach very quickly. By the end of this Chapter, for example, we will have learned how to produce a small-multiple plot of time series data for a large number of countries, with a smoothed regression line in each panel. So in another sense we will have moved quite fast.

3.1 How ggplot Works

As we saw in Chapter 2, visualization involves representing your data data using lines or shapes or colors and so on. There is some structured relationship, some mapping, between the variables in your data and their representation in the plot displayed on your screen or on the page. We also saw that not all mappings make sense for all types of variables, and (independently of this), some representations are harder to interpret than others. Ggplot provides you with a set of tools to map data to visual elements on your plot, to specify the kind of plot you want, and then subsquently to control the fine details of how it will be displayed. Figure 3.1 shows a schematic outline of the process starting from data, at the top, down to a finished plot at the bottom. Don’t worry about the details for now—we will be doing into them one piece at a time over the next few chapters.

The main elements of ggplot's grammar of graphics. Figure 3.1: The main elements of ggplot’s grammar of graphics.

The most important thing to get used to with ggplot is the way you use it to think about the logical structure of your plot. The code you write says what the connections are between the variables in your data, and the plot elements you see on the screen—things like points, colors, and shapes. In ggplot, these logical connections between your data and the plot elements are called aesthetic mappings or just aesthetics. You begin every plot by telling the ggplot() function what your data is, and then how the variables in this data logically map onto the plot’s aesthetics. Then you take the result and say what general sort of plot you want, such as a scatterplot, a boxplot, or a bar chart. In ggplot, the overall type of plot is called a geom. Each geom has a function that creates it. For example, geom_point() makes scatterplots, geom_bar() makes barplots, geom_boxplot() makes boxplots, and so on. You combine these two pieces, the ggplot() object and the geom, by literally adding them together in an expression, using the “+” symbol.

At this point, ggplot will have enough information to be able to draw a plot for you. The rest is just details about exactly what you want to see. If you don’t specify anything further, ggplot will use a set of defaults that try to be sensible about what gets drawn. But more often, you will want to specify exactly what you want, including information about the scales, the labels of legends and axes, and other guides that help people to read the plot. These additional pieces are added to the plot in the same way as the geom_ function was. Each component has it own function, you provide arguments to it specifying what to do, and you literally add it to the sequence of instructions. In this way you systematically build your plot piece by piece.

In this chapter we will go through the main steps of this process. We will proceed by example, repeatedly building a series of plots. As noted earlier, I strongly encourage you go through this exercise manually, typing (rather than copying-and-pasting) the code yourself. This may seem a bit tedious, but it is by far the most effective way to get used to what is happening, and to get a feel for R’s syntax. While you’ll inevitably make some errors, you will also quickly find yourself becoming able to diagnose your own errors, as well as having a better grasp of the higher-level structure of plots. You should open the RMarkdown file for your notes, remember to load the tidyverse librarylibrary(tidyverse) and write the code out in chunks, interspersing your own notes and comments as you go.

Table 3.1: Life Expectancy data in wide format.

country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Afghanistan 29 30 32 34 36 38 40 41 42 42 42 44
Albania 55 59 65 66 68 69 70 72 72 73 76 76
Algeria 43 46 48 51 55 58 61 66 68 69 71 72
Angola 30 32 34 36 38 39 40 40 41 41 41 43
Argentina 62 64 65 66 67 68 70 71 72 73 74 75
Australia 69 70 71 71 72 73 75 76 78 79 80 81

3.2 Tidy Data

The tidyverse tools we will be using want to see your data in a particular sort of shape, generally referred to as “tidy data” (Wickham 2014). Social scientists will likely be familiar with the distinction between wide format and long format data. In a long format table, every variable is a column, and every observation is a row. In a wide format table, some variables are spread out across columns. For example, Table 3.1 shows part of a table of life expectancy over time for a series of countries. This is in wide format, because one of the variables—year—is spread across the columns of the table.

By contrast, Table 3.2 shows the beginning of the same data in long format. The tidy data that ggplot wants is in this long form. In a related bit of terminology, in this table the year variable is sometimes called a key and the lifeExp variable is the value taken by that key for any particular row. These terms are useful when converting tables from wide to long format. I am speaking fairly loosely here. Underneath these terms there is a worked-out theory of the forms that tabular data can be stored in, but right now we don’t need to know those additional details. For more discussion on the ideas behind tidy data, see the discussion in the Appendix, and the references provided there.

Table 3.2: Life Expectancy data in long format.

country year lifeExp
Afghanistan 1952 29
Afghanistan 1957 30
Afghanistan 1962 32
Afghanistan 1967 34
Afghanistan 1972 36
Afghanistan 1977 38

As you can see from a comparison of Tables 3.1 and 3.2, a tidy table does not present data in its most compact form. In fact, it is usually not how you would choose to present your data if you wanted to just show people the numbers. Neither is untidy data “messy” or the “wrong” way to lay out data in some generic sense. It’s just that, even if its long-form shape makes tables larger, tidy data is much more straightforward to work with when it comes to specifying the mappings that you need to coherently describe plots.

3.4 Build Your Plots Layer by Layer

Although we got a brief taste of ggplot in Chapter 1, we then spent some time preparing the ground before we made this first proper plot. We set up our software IDE and made sure we could reproduce our work. We got some sense of what people see when they look at plots. We learned the basics of how R works. And we went through the logic of ggplot’s main idea, that of building up plots a piece at a time in a systematic and predictable fashion.

The good news is that, from now on, not much will change conceptually about what we are doing. It will be more a question of learning in greater detail about how to tell ggplot what to do. We will learn more about the different geoms available, and find out about the functions that control the coordinate system, scales, guiding elements (like labels and tick marks), and thematic features of plots. Conceptually, however, we will always be doing the same thing. We will start with a table of data that has been tidied, and then we will:

  1. Tell the ggplot() function what our data is.The data = … step.
  2. Tell ggplot() what relationships we want to see.The mapping = aes(…) step. For convenience we will put the results of the first two steps in an object called p.
  3. Tell ggplot how we want to see the relationships in our data.Choose a geom.
  4. Layer on geoms as needed, by adding them to the p object one at a time.
  5. UseThe scale_, family, labs() and guides() functions. some additional functions to adjust scales, labels, tick marks, titles. We’ll learn more about some of these functions shortly.

To begin with we will let ggplot use its defaults for many of these elements. The coordinate system used in plots is most often cartesian, for example. It is a plane defined by an x axis and a y axis. This is what ggplot assumes, unless you tell it otherwise. But we will quickly start making some adjustments. Bear in mind once again that the process of adding layers to the plot really is additive.In effect we create one big object that is a nested list of instructions for how to draw each piece of the plot. Usually in R, functions cannot simply be added to objects. Rather, they take objects as inputs and produce objects as outputs. But the objects created by ggplot() are special. This makes it easier to assemble plots one piece at a time, and to inspect how they look at every step. For example, let’s try a different geom_ function with our plot.

Life Expectancy vs GDP, using a smoother. Figure 3.5: Life Expectancy vs GDP, using a smoother.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_smooth()

You can see right away that some of these geoms do a lot more than simply put points on a grid. Here geom_smooth() has calculated a smoothed line for us and shaded in a ribbon showing the standard error for the line. If we want to see the data points and the line together, we simply add geom_point() back in:

Life Expectancy vs GDP, showing both points and a GAM smoother. Figure 3.6: Life Expectancy vs GDP, showing both points and a GAM smoother.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() + geom_smooth() 

## `geom_smooth()` using method = 'gam'

The console message R tells you the geom_smooth() function is using a method called gam, which in this case means it has fit a generalized additive model. This suggests that maybe there are other methods that geom_smooth() understands, and which we might tell it to use instead. Instructions are given to functions via their arguments, so we can try adding method = "lm" (for “linear model”) as an argument to geom_smooth():

Life Expectancy vs GDP, points and an ill-advised linear fit. Figure 3.7: Life Expectancy vs GDP, points and an ill-advised linear fit.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() + geom_smooth(method="lm") 

Notice that we did not have to tell geom_point() or geom_smooth() where their data was coming from, or what mappings they should use. They inherit this information from the original p object. As we’ll see later, it’s possible to give geoms separate instructions that they will follow instead. But in the absence of any other information, the geoms will look for the instructions it needs in the ggplot() function, or the object created by it.

In our plot, the data is quite bunched up against the left side. Gross Domestic Product per capita is not normally distributed across our country years. The x-axis scale would probably look better if it were transformed from a linear scale to a log scale. For this we can use a function called scale_x_log10(). As you might expect this function scales the x-axis of a plot to a log 10 basis. To use it we just add it to the plot:

Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis. Figure 3.8: Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() +
    geom_smooth(method="gam") +
    scale_x_log10()

The x-axis transformation repositions the points, and also changes the shape the smoothed line. (Notice that we switched back to gam from lm.) While ggplot() and its associated functions have not made any changes to our underlying data frame, the scale transformation is applied to the data before the smoother is layered on to the plot. There are a variety of scale transformations that you can use in just this way. Each is named for the transformation you want to apply, and the axis you want to applying it to. In this case we have scale_x_log10(). You can also try scale_x_sqrt() and scale_x_reverse(). There are corresponding functions for y-axis transformations. Experiment with them to see what sort of effect they have on the plot. For example, what happens when you put the geom_smooth() function before geom_point() instead of after it? What does this tell you about how the layers of the plot are drawn?

Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis, with better labels on the tick marks. Figure 3.9: Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis, with better labels on the tick marks.

At this point, if our goal was just to show a plot of Life Expectancy vs GDP using sensible scales and adding a smoother, we would be thinking about polishing up the plot with nicer axis labels and a title. Perhaps we might also want to replace the scientific notation on the x-axis with the dollar value it actually represents. We can do both of these things quite easily. Let’s take care of the scale first. The labels on the tick-marks can be controlled through the scale_ functions. While it’s possible to roll your own function to label axes (or just supply your labels manually, as we will see later), there’s also a handy scales library that contains some useful pre-made formatting functions. We can either load the whole library with library(scales) or, more conveniently, just grab the specific formatter we want from that library. Here it’s the dollar() function. To reference a function directly from a library we have not loaded we use the syntax thelibrary::thefunction. So, we can do this:

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point() +
    geom_smooth(method="gam") +
    scale_x_log10(labels = scales::dollar)

We will learn more about scale transformations later. For now, just remember two things about them. First, you can directly transform your x or y axis by adding something like scale_x_log10() or scale_y_log10() to your plot. When you do so, the x or y axis will be transformed and, by default, the tick marks on the axis will be labeled using scientific notation. Second, you can give these scale_ functions a labels argument that reformats the text printed underneath the tick marks on the axes. Inside the scale_x_log10() function try labels=scales::comma, for example.

3.5 Mapping Aesthetics vs Setting Them

An aesthetic mapping specifies that a variable will be expressed by one of the available visual elements, such as size, or color, or shape, and so on. As we’ve seen, we map variables to aesthetics like this:

p <-  ggplot(data = gapminder,
             mapping = aes(x = gdpPercap,
                            y = lifeExp,
                            color = continent))

This code does not give a direct instruction like “color the points purple”. Instead we are saying “color will represent the variable continent”, or “color will be mapped to continent”. If we do want to turn all the points in the figure purple, we do not do it through the mapping function. Look at what happens when we try:

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = "purple"))
p + geom_point() +
    geom_smooth(method='loess') +
    scale_x_log10()

Figure 3.10: What has gone wrong here?

What has gone wrong here?

What has happened here? Why is there a legend saying “purple”? And why have the points all turned pinkish-red instead of purple? Remember, an aesthetic is a mapping of variables in your data to properties you can see on the graph. The aes() function is where that mapping is specified, and the function is trying to do its job. It wants to map a variable to the color aesthetic, so it assumes you are giving it a variable. We have only given it one word, though—“purple”. Still, aes() will do its best to treat that word as though it were a variable.Just as in Chapter 1, when we were able to write ‘my_numbers + 1’ to add one to each element of the vector. A variable should have as many observations as there are rows in the data, so aes() falls back on R’s recycling rules for making vectors of different lengths match up.

In effect, this creates a new categorical variable for your data. The string “purple” is recycled for every row of your data. Now you have a new column. Every element in it has the same value, “purple”. Then ggplot plots the results on the graph as you’ve asked it to—by mapping it to the color aesthetic. It dutifully makes a legend for this new variable. By default, ggplot displays the points falling into the category “purple”—which is all of them—using its default first-category hue … which is red.

The aes() function is for mappings only. Do not use it to change properties to a particular value. If we want to set a property, we do it in the geom_ we are using, outside the aes() function. Try this:

Setting the color attribute of the points directly. Figure 3.11: Setting the color attribute of the points directly.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color="purple") +
    geom_smooth(method='loess') +
    scale_x_log10()

The geom_point() function can take a color argument directly, and R knows what color “purple” is. This is not part of the aesthetic mapping that defines the basic structure of the graphic. From the point of view of the grammar or logic of the graph, the fact that the points are colored purple has no significance. The color purple is not representing or mapping a variable or feature of the data in the relevant way.

The various geom_ functions can take many other arguments that will affect how the plot looks, but that do not involve mapping variables to aesthetic elements. Thus, those arguments will never go inside the aes() function. Some of the things we will want to set, like color or size, have the same name as mappable elements. Others, like the method or se arguments in geom_smooth() affect other aspects of the plot. In this plot’s geom_smooth() call, smoothing line’s color is set to orange and its size (i.e., thickness) made unreasonably large. We have also turned off the se option, so the standard error ribbon is not shown.

Setting some other arguments. Figure 3.12: Setting some other arguments.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(alpha = 0.3) +
    geom_smooth(color = "orange", se = FALSE, size = 8, method = "lm") +
    scale_x_log10()

Meanwhile in the geom_smooth() call we set the alpha argument to 0.3. Like color, size, and shape, “alpha” is an aesthetic property that points (and some other plot elements) have, and to which variables can be mapped. It controls how transparent the object will appear when drawn. It is measured on a scale of zero to one. An object with an alpha of zero will be completely transparent. NoteIt is also possible to map a continuous variable directly to the alpha property, much like one might map a continuous variable to a single-color gradient. However, this is generally not an effective way of precisely conveying variation in quantity. that this will make any other mappings the object might have, such as color or size, invisible as well. An object with an alpha of one is completely opaque, so any objects or layers directly underneath will not be visible. Choosing an intermediate value of alpha can be useful when there is a lot data to plot. It makes it easier to see where the bulk of the observations are located, especially when there is a considerable amount of overlap in values.

A more polished plot of Life Expectancy vs GDP.

Figure 3.13: A more polished plot of Life Expectancy vs GDP.

We are now in a position to make a reasonably polished plot. We can set the alpha of the points to a low value, add then make nicer x- and y-axis labels, and add a title and subtitle.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point(alpha = 0.3) +
    geom_smooth(method = "gam") +
    scale_x_log10(labels = scales::dollar) +
    labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

As you can see in the code above, in addition to x, y, and any other aesthetic mappings in your plot (such as size, fill, or color), the labs() function can also set the text for your title, subtitle, and caption. Notice that labs() controls the main labels of the axes. The appearance of things like the axis tick marks are the responsibility of various scale_ functions, such as the scale_x_log10() function used here. We will learn more about what can be done with scale_ functions soon.

Are there any variables in our data that can sensibly be mapped to the color aesthetic? Consider continent. In Figure 3.14 the individual data points have been colored by continent, and a legend with a key to the colors has automatically been added to the plot. In addition, instead of one smoothing line we now have five—one for each unique value of the continent variable. This is a consequence of the way aesthetic mappings are inherited. Along with x and y, the color aesthetic mapping is set in the call to ggplot() that we used to creat the p object. Unless told otherwise, all geoms layered on top of the original plot object will inherit that object’s mappings. In this case we get both our points and smoothers colored by continent.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = continent))
p + geom_point() +
    geom_smooth(method='loess') +
    scale_x_log10()

Mapping the continent variable to the color aesthetic. Figure 3.14: Mapping the continent variable to the color aesthetic.

If it is what we want, then we might also consider shading the standard error ribbon of each line to match its dominant color. The color of the standard error ribbon is controlled by the fill aesthetic. Whereas the color aesthetic affects the appearance of lines and points, fill is for the filled areas of bars, polygons, and—in this case—the interior of the smoother’s standard error ribbon.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = continent,
                          fill = continent))
p + geom_point() +
    geom_smooth(method='loess') +
    scale_x_log10()

Mapping the continent variable to the color aesthetic, and correcting the error bars using the fill aesthetic. Figure 3.15: Mapping the continent variable to the color aesthetic, and correcting the error bars using the fill aesthetic.

Making sure that color and fill aesthetics match up consistently in this way improves the overall look of the plot. As you can see, in order to make it happen we just need to specify that the mappings are to the same variable in each case.

3.6 Aesthetics can be Mapped per Geom

Perhaps five separate smoothers is too many, and we just want one line. But we still would like to have the points color-coded by continent. By default, geoms inherit their mappings from the ggplot() function. We can change this by mapping the aesthetics we want only the geom_ functions that we want them to apply to. We use the same mapping = aes(...) expression as in the initial call to ggplot(), but now use it in the geom_ functions as well, specifying only the mappings we want to apply to each one. Mappings specified only in the initial ggplot() function—in this case, x and y—will carry through to all subsequent geoms.

Mapping aesthetics on a per-geom basis. Here color is mapped to continent for the points but not the smoother. Figure 3.16: Mapping aesthetics on a per-geom basis. Here color is mapped to continent for the points but not the smoother.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
    geom_smooth(method='loess') +
    scale_x_log10()    

It’s possible to map continuous variables to the color aesthetic, too. For example, we can map each country-year’s population (pop) to color. When we do this, ggplot produces a gradient scale. It is continuous, but marked at intervals in the legend. Generally, mapping quantities like population to a continuous color gradient is less effective than cutting the variable into categorical bins running, e.g., from low to high.

Mapping a continuous variable to color. Figure 3.17: Mapping a continuous variable to color.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(mapping = aes(color = pop)) +
    scale_x_log10()    

The main flow of action in ggplot is always the same. You start with a table of data, you map the variables you want to display to aesthetics like position, color, or shape, and you choose one or more geoms to draw the graph. In your code this gets accomplished by making an object with the basic information about data and mappings, and then adding or layering additional information as needed. Once you get used to this way of constructing your plots—especially the aesthetic mapping part—drawing plots becomes much easier. Instead of having to think about how to draw particular shapes or colors on the screen, the many geom_ functions take care of that for you. In the same way, learning new geoms is easier once you think of them as ways to display aesthetic mappings that you specify. Most of the learning curve with ggplot involves getting used to this way of thinking about your data and its representation in a plot. In the next chapter, we will flesh out these ideas a little more, cover some common ways plots go “wrong” (i.e., when they end up looking strange), and learn how to recognize and avoid those problems.