3 Make a plot

This Chapter will teach you how to use ggplot’s core functions to produce a series of scatterplots. From one point of view, we will proceed slowly and carefully, taking our time to understand the logic behind the commands that you type. The reason for this is that the central activity of visualizing data with ggplot more or less always involves the same sequence of steps. So it is worth learning what they are.

From another point of view, though, we will move fast. Once you have the basic sequence down, and understand how it is that ggplot assembles the pieces of a plot in to a final image, then you will find that analytically and aesthetically sophisticated plots come within your reach very quickly. By the end of this Chapter, for example, we will have learned how to produce a small-multiple plot of time series data for a large number of countries, with a smoothed regression line in each panel. So in another sense we will have moved quite fast.

3.1 How ggplot works

As we saw in Chapter 1, visualization involves representing your data data using lines or shapes or colors and so on. There is some structured relationship, some mapping, between the variables in your data and their representation in the plot displayed on your screen or on the page. We also saw that not all mappings make sense for all types of variables, and (independently of this), some representations are harder to interpret than others. Ggplot provides you with a set of tools to map data to visual elements on your plot, to specify the kind of plot you want, and then subsquently to control the fine details of how it will be displayed. Figure 3.1 shows a schematic outline of the process starting from data, at the top, down to a finished plot at the bottom. Don’t worry about the details for now. We will be doing into them one piece at a time over the next few chapters.

The main elements of ggplot's grammar of graphics. This chapter goes through these steps in detail. Figure 3.1: The main elements of ggplot’s grammar of graphics. This chapter goes through these steps in detail.

The most important thing to get used to with ggplot is the way you use it to think about the logical structure of your plot. The code you write specifies the connections between the variables in your data, and the colors, points, and shapes you see on the screen. In ggplot, these logical connections between your data and the plot elements are called aesthetic mappings or just aesthetics. You begin every plot by telling the ggplot() function what your data is, and then how the variables in this data logically map onto the plot’s aesthetics. Then you take the result and say what general sort of plot you want, such as a scatterplot, a boxplot, or a bar chart. In ggplot, the overall type of plot is called a geom. Each geom has a function that creates it. For example, geom_point() makes scatterplots, geom_bar() makes barplots, geom_boxplot() makes boxplots, and so on. You combine these two pieces, the ggplot() object and the geom, by literally adding them together in an expression, using the “+” symbol.

At this point, ggplot will have enough information to be able to draw a plot for you. The rest is just details about exactly what you want to see. If you don’t specify anything further, ggplot will use a set of defaults that try to be sensible about what gets drawn. But more often, you will want to specify exactly what you want, including information about the scales, the labels of legends and axes, and other guides that help people to read the plot. These additional pieces are added to the plot in the same way as the geom_ function was. Each component has it own function, you provide arguments to it specifying what to do, and you literally add it to the sequence of instructions. In this way you systematically build your plot piece by piece.

In this chapter we will go through the main steps of this process. We will proceed by example, repeatedly building a series of plots. As noted earlier, I strongly encourage you go through this exercise manually, typing (rather than copying-and-pasting) the code yourself. This may seem a bit tedious, but it is by far the most effective way to get used to what is happening, and to get a feel for R’s syntax. While you’ll inevitably make some errors, you will also quickly find yourself becoming able to diagnose your own errors, as well as having a better grasp of the higher-level structure of plots. You should open the RMarkdown file for your notes, remember to load the tidyverse librarylibrary(tidyverse) and write the code out in chunks, interspersing your own notes and comments as you go.

Table 3.1: Life Expectancy data in wide format.

country 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Afghanistan 29 30 32 34 36 38 40 41 42 42 42 44
Albania 55 59 65 66 68 69 70 72 72 73 76 76
Algeria 43 46 48 51 55 58 61 66 68 69 71 72
Angola 30 32 34 36 38 39 40 40 41 41 41 43
Argentina 62 64 65 66 67 68 70 71 72 73 74 75
Australia 69 70 71 71 72 73 75 76 78 79 80 81

3.2 Tidy data

The tidyverse tools we will be using want to see your data in a particular sort of shape, generally referred to as “tidy data” (Wickham, 2014). Social scientists will likely be familiar with the distinction between wide format and long format data. In a long format table, every variable is a column, and every observation is a row. In a wide format table, some variables are spread out across columns. For example, Table 3.1 shows part of a table of life expectancy over time for a series of countries. This is in wide format, because one of the variables, year, is spread across the columns of the table.

By contrast, Table 3.2 shows the beginning of the same data in long format. The tidy data that ggplot wants is in this long form. In a related bit of terminology, in this table the year variable is sometimes called a key and the lifeExp variable is the value taken by that key for any particular row. These terms are useful when converting tables from wide to long format. I am speaking fairly loosely here. Underneath these terms there is a worked-out theory of the forms that tabular data can be stored in, but right now we don’t need to know those additional details. For more on the ideas behind tidy data, see the discussion in the Appendix. You will also find an example showing the R code you need to get from an untidy to a tidy shape for the common “wide” case where some variables are spread out across the columns of your table.

Table 3.2: Life Expectancy data in long format.

country year lifeExp
Afghanistan 1952 29
Afghanistan 1957 30
Afghanistan 1962 32
Afghanistan 1967 34
Afghanistan 1972 36
Afghanistan 1977 38

If you compare Tables 3.1 and 3.2, it is clear that a tidy table does not present data in its most compact form. In fact, it is usually not how you would choose to present your data if you wanted to just show people the numbers. Neither is untidy data “messy” or the “wrong” way to lay out data in some generic sense. It’s just that, even if its long-form shape makes tables larger, tidy data is much more straightforward to work with when it comes to specifying the mappings that you need to coherently describe plots.

3.4 Build your plots layer by layer

Although we got a brief taste of ggplot at the end of Chapter 2, we spent more time in that Chaper preparing the ground to make this first proper graph. We set up our software IDE and made sure we could reproduce our work. We then learned the basics of how R works, and the sort of tidy data that ggplot expects. Just now we went through the logic of ggplot’s main idea, of building up plots a piece at a time in a systematic and predictable fashion, beginning with a mapping between a variable and an aesthetic element. We have done a lot of work and produced one plot.

The good news is that, from now on, not much will change conceptually about what we are doing. It will be more a question of learning in greater detail about how to tell ggplot what to do. We will learn more about the different geoms (or types of plot) available, and find out about the functions that control the coordinate system, scales, guiding elements (like labels and tick marks), and thematic features of plots. This will allow us to make much more sophisticated plots surprisingly fast. Conceptually, however, we will always be doing the same thing. We will start with a table of data that has been tidied, and then we will:

  1. Tell the ggplot() function what our data is.The data = … step.
  2. Tell ggplot() what relationships we want to see.The mapping = aes(…) step. For convenience we will put the results of the first two steps in an object called p.
  3. Tell ggplot how we want to see the relationships in our data.Choose a geom.
  4. Layer on geoms as needed, by adding them to the p object one at a time.
  5. UseThe scale_, family, labs() and guides() functions. some additional functions to adjust scales, labels, tick marks, titles. We’ll learn more about some of these functions shortly.

To begin with we will let ggplot use its defaults for many of these elements. The coordinate system used in plots is most often cartesian, for example. It is a plane defined by an x axis and a y axis. This is what ggplot assumes, unless you tell it otherwise. But we will quickly start making some adjustments. Bear in mind once again that the process of adding layers to the plot really is additive.In effect we create one big object that is a nested list of instructions for how to draw each piece of the plot. Usually in R, functions cannot simply be added to objects. Rather, they take objects as inputs and produce objects as outputs. But the objects created by ggplot() are special. This makes it easier to assemble plots one piece at a time, and to inspect how they look at every step. For example, let’s try a different geom_ function with our plot.

Life Expectancy vs GDP, using a smoother. Figure 3.5: Life Expectancy vs GDP, using a smoother.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_smooth()

You can see right away that some of these geoms do a lot more than simply put points on a grid. Here geom_smooth() has calculated a smoothed line for us and shaded in a ribbon showing the standard error for the line. If we want to see the data points and the line together, we simply add geom_point() back in:

Life Expectancy vs GDP, showing both points and a GAM smoother. Figure 3.6: Life Expectancy vs GDP, showing both points and a GAM smoother.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() + geom_smooth() 

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The console message R tells you the geom_smooth() function is using a method called gam, which in this case means it has fit a generalized additive model. This suggests that maybe there are other methods that geom_smooth() understands, and which we might tell it to use instead. Instructions are given to functions via their arguments, so we can try adding method = "lm" (for “linear model”) as an argument to geom_smooth():

Life Expectancy vs GDP, points and an ill-advised linear fit. Figure 3.7: Life Expectancy vs GDP, points and an ill-advised linear fit.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() + geom_smooth(method = "lm") 

We did not have to tell geom_point() or geom_smooth() where their data was coming from, or what mappings they should use. They inherit this information from the original p object. As we’ll see later, it’s possible to give geoms separate instructions that they will follow instead. But in the absence of any other information, the geoms will look for the instructions it needs in the ggplot() function, or the object created by it.

In our plot, the data is quite bunched up against the left side. Gross Domestic Product per capita is not normally distributed across our country years. The x-axis scale would probably look better if it were transformed from a linear scale to a log scale. For this we can use a function called scale_x_log10(). As you might expect this function scales the x-axis of a plot to a log 10 basis. To use it we just add it to the plot:

Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis. Figure 3.8: Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y=lifeExp))
p + geom_point() +
    geom_smooth(method = "gam") +
    scale_x_log10()

The x-axis transformation repositions the points, and also changes the shape the smoothed line. (We switched back to gam from lm.) While ggplot() and its associated functions have not made any changes to our underlying data frame, the scale transformation is applied to the data before the smoother is layered on to the plot. There are a variety of scale transformations that you can use in just this way. Each is named for the transformation you want to apply, and the axis you want to applying it to. In this case we use scale_x_log10().

At this point, if our goal was just to show a plot of Life Expectancy vs GDP using sensible scales and adding a smoother, we would be thinking about polishing up the plot with nicer axis labels and a title. Perhaps we might also want to replace the scientific notation on the x-axis with the dollar value it actually represents. We can do both of these things quite easily. Let’s take care of the scale first. The labels on the tick-marks can be controlled through the scale_ functions. While it’s possible to roll your own function to label axes (or just supply your labels manually, as we will see later), there’s also a handy scales library that contains some useful pre-made formatting functions. We can either load the whole library with library(scales) or, more conveniently, just grab the specific formatter we want from that library. Here it’s the dollar() function. To grab a function directly from a library we have not loaded, we use the syntax thelibrary::thefunction. So, we can do this:

Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis, with better labels on the tick marks. Figure 3.9: Life Expectancy vs GDP scatterplot, with a GAM smoother and a log scale on the x-axis, with better labels on the tick marks.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point() +
    geom_smooth(method = "gam") +
    scale_x_log10(labels = scales::dollar)

We will learn more about scale transformations later. For now, just remember two things about them. First, you can directly transform your x or y axis by adding something like scale_x_log10() or scale_y_log10() to your plot. When you do so, the x or y axis will be transformed and, by default, the tick marks on the axis will be labeled using scientific notation. Second, you can give these scale_ functions a labels argument that reformats the text printed underneath the tick marks on the axes. Inside the scale_x_log10() function try labels=scales::comma, for example.

3.5 Mapping aesthetics vs setting them

An aesthetic mapping specifies that a variable will be expressed by one of the available visual elements, such as size, or color, or shape, and so on. As we’ve seen, we map variables to aesthetics like this:

p <-  ggplot(data = gapminder,
             mapping = aes(x = gdpPercap,
                            y = lifeExp,
                            color = continent))

This code does not give a direct instruction like “color the points purple”. Instead it says, “the property ‘color’ will represent the variable continent”, or “color will map continent”. If we want to turn all the points in the figure purple, we do not do it through the mapping function. Look at what happens when we try:

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = "purple"))
p + geom_point() +
    geom_smooth(method = "loess") +
    scale_x_log10()

Figure 3.10: What has gone wrong here?

What has gone wrong here?

What has happened here? Why is there a legend saying “purple”? And why have the points all turned pinkish-red instead of purple? Remember, an aesthetic is a mapping of variables in your data to properties you can see on the graph. The aes() function is where that mapping is specified, and the function is trying to do its job. It wants to map a variable to the color aesthetic, so it assumes you are giving it a variable.Just as in Chapter 1, when we were able to write ‘my_numbers + 1’ to add one to each element of the vector. We have only given it one word, though—“purple”. Still, aes() will do its best to treat that word as though it were a variable. A variable should have as many observations as there are rows in the data, so aes() falls back on R’s recycling rules for making vectors of different lengths match up.

In effect, this creates a new categorical variable for your data. The string “purple” is recycled for every row of your data. Now you have a new column. Every element in it has the same value, “purple”. Then ggplot plots the results on the graph as you’ve asked it to, by mapping it to the color aesthetic. It dutifully makes a legend for this new variable. By default, ggplot displays the points falling into the category “purple” (which is all of them) using its default first-category hue … which is red.

The aes() function is for mappings only. Do not use it to change properties to a particular value. If we want to set a property, we do it in the geom_ we are using, and outside the mapping = aes(...) step. Try this:

Setting the color attribute of the points directly. Figure 3.11: Setting the color attribute of the points directly.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(color = "purple") +
    geom_smooth(method = "loess") +
    scale_x_log10()

The geom_point() function can take a color argument directly, and R knows what color “purple” is. This is not part of the aesthetic mapping that defines the basic structure of the graphic. From the point of view of the grammar or logic of the graph, the fact that the points are colored purple has no significance. The color purple is not representing or mapping a variable or feature of the data in the relevant way.

Setting some other arguments. Figure 3.12: Setting some other arguments.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp)) 
p + geom_point(alpha = 0.3) +
    geom_smooth(color = "orange", se = FALSE, size = 8, method = "lm") +
    scale_x_log10()

The various geom_ functions can take many other arguments that will affect how the plot looks, but that do not involve mapping variables to aesthetic elements. Thus, those arguments will never go inside the aes() function. Some of the things we will want to set, like color or size, have the same name as mappable elements. Others, like the method or se arguments in geom_smooth() affect other aspects of the plot. In the code for Figure 3.12, the geom_smooth() call sets the line color to orange and sets its size (i.e., thickness) to 8, an unreasonably large value. We also turn off the se option by switching it from its default value of TRUE to FALSE. The result is that the standard error ribbon is not shown.

MeanwhileIt’s also possible to map a continuous variable directly to the alpha property, much like one might map a continuous variable to a single-color gradient. However, this is generally not an effective way of precisely conveying variation in quantity. in the geom_smooth() call we set the alpha argument to 0.3. Like color, size, and shape, “alpha” is an aesthetic property that points (and some other plot elements) have, and to which variables can be mapped. It controls how transparent the object will appear when drawn. It’s measured on a scale of zero to one. An object with an alpha of zero will be completely transparent. Setting it to zero will make any other mappings the object might have, such as color or size, invisible as well. An object with an alpha of one will be completely opaque. Choosing an intermediate value can be useful when there is a lot of overlapping data to plot, as it makes it easier to see where the bulk of the observations are located.

A more polished plot of Life Expectancy vs GDP.

Figure 3.13: A more polished plot of Life Expectancy vs GDP.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y=lifeExp))
p + geom_point(alpha = 0.3) +
    geom_smooth(method = "gam") +
    scale_x_log10(labels = scales::dollar) +
    labs(x = "GDP Per Capita", y = "Life Expectancy in Years",
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder.")

We can now make a reasonably polished plot. We set the alpha of the points to a low value, make nicer x- and y-axis labels, and add a title, subtitle, and caption. As you can see in the code above, in addition to x, y, and any other aesthetic mappings in your plot (such as size, fill, or color), the labs() function can also set the text for title, subtitle, and caption. It controls the main labels of scales. The appearance of things like axis tick marks are the responsibility of various scale_ functions, such as the scale_x_log10() function used here. We will learn more about what can be done with scale_ functions soon.

Are there any variables in our data that can sensibly be mapped to the color aesthetic? Consider continent. In Figure 3.14 the individual data points have been colored by continent, and a legend with a key to the colors has automatically been added to the plot. In addition, instead of one smoothing line we now have five. There is one for each unique value of the continent variable. This is a consequence of the way aesthetic mappings are inherited. Along with x and y, the color aesthetic mapping is set in the call to ggplot() that we used to creat the p object. Unless told otherwise, all geoms layered on top of the original plot object will inherit that object’s mappings. In this case we get both our points and smoothers colored by continent.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = continent))
p + geom_point() +
    geom_smooth(method = "loess") +
    scale_x_log10()

Mapping the continent variable to the color aesthetic. Figure 3.14: Mapping the continent variable to the color aesthetic.

If it is what we want, then we might also consider shading the standard error ribbon of each line to match its dominant color. The color of the standard error ribbon is controlled by the fill aesthetic. Whereas the color aesthetic affects the appearance of lines and points, fill is for the filled areas of bars, polygons and, in this case, the interior of the smoother’s standard error ribbon.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp,
                          color = continent,
                          fill = continent))
p + geom_point() +
    geom_smooth(method = "loess") +
    scale_x_log10()

Mapping the continent variable to the color aesthetic, and correcting the error bars using the fill aesthetic. Figure 3.15: Mapping the continent variable to the color aesthetic, and correcting the error bars using the fill aesthetic.

Making sure that color and fill aesthetics match up consistently in this way improves the overall look of the plot. In order to make it happen we just need to specify that the mappings are to the same variable in each case.

3.6 Aesthetics can be mapped per geom

Perhaps five separate smoothers is too many, and we just want one line. But we still would like to have the points color-coded by continent. By default, geoms inherit their mappings from the ggplot() function. We can change this by mapping the aesthetics we want only the geom_ functions that we want them to apply to. We use the same mapping = aes(...) expression as in the initial call to ggplot(), but now use it in the geom_ functions as well, specifying only the mappings we want to apply to each one. Mappings specified only in the initial ggplot() function—here, x and y—will carry through to all subsequent geoms.

Mapping aesthetics on a per-geom basis. Here color is mapped to continent for the points but not the smoother. Figure 3.16: Mapping aesthetics on a per-geom basis. Here color is mapped to continent for the points but not the smoother.

p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point(mapping = aes(color = continent)) +
    geom_smooth(method = "loess") +
    scale_x_log10()

It’s possible to map continuous variables to the color aesthetic, too. For example, we can map the log of each country-year’s population (pop) to color. (We can take the log of population right in the aes() statement, using the log() function. R will evaluate this for us quite happily.) When we do this, ggplot produces a gradient scale. It is continuous, but marked at intervals in the legend. Depending on the circumstances, mapping quantities like population to a continuous color gradient may be more or less effective than cutting the variable into categorical bins running, e.g., from low to high. In general it is always worth looking at the data in its continuous form first rather than cutting or binning it into categories.

Mapping a continuous variable to color. Figure 3.17: Mapping a continuous variable to color.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap,
                          y = lifeExp))
p + geom_point(mapping = aes(color = log(pop))) +
    scale_x_log10()    

Finally, it is worth paying a little more attention to the way that ggplot draws its scales. Because every mapped variable has a scale, we can learn a lot about how a plot has been constructed, and what mappings it contains, by seeing what the legends look like. For example, take a closer look at the legends produced in Figures 3.15 and 3.16.

Figure 3.18: Guides and legends faithfully reflect the mappings they represent.

Guides and legends faithfully reflect the mappings they represent.

In the legend for the first figure, shown here on the left, we see several visual elements. The key for each continent shows a dot, a line, and a shaded background. The key for the second figure, shown on the right, has only a dot for each continent, with no shaded background or line. If you look again at the code for Figures 3.15 and 3.16, you will see that in the first case we mapped the continent variable to both color and fill. We then drew the figure with geom_point() and fitted a line for each continent with geom_smooth(). Points have color but the smoother understands both color (for the line itself) and fill (for the shaded standard error ribbon). Each of these elements is represented in the legend: the point color, the line color, and the ribbon fill. In the second figure, we decided to simplify things by having only the points be colored by continent. Then we drew just a single smoother for the whole graph. Thus, in the legend for that figure, the colored line and the shaded box are both absent. We only see a legend for the mapping of color to continent in geom_point(). Meanwhile on the graph itself the line drawn by geom_smooth() is set by default to a bright blue, different from anything on the scale, and its shaded error ribbon is set by default to gray. Small details like this are not accidents. They are a direct consequence of ggplot’s grammatical way of thinking about the relationship between the data behind the plot and the visual elements that represent it.

3.7 Save your work

Now that you have started to make your own plots, you may be wondering how to save them, and perhaps also how to control their size and format. If you are working in an RMarkdown document then the plots you make will be embedded in it, as we have already seen. You can set the default size of plots within your .Rmd document by setting an option in your first code chunk. This one tells R to make 8x5 figures:

knitr::opts_chunk$set(fig.width=8, fig.height=5) 

Because you will be making plots of different sizes and shapes, sometimes you will want to control the size of particular plots, without changing the default. To do this, you can add the same options to any particular chunk inside the curly braces at the beginning. Remember, each chunk opens with three backticks and then a pair of braces containing the language name (for us always r) and an optional label:

```{r example}
p + geom_point()
```

You can follow the label with a comma and provide a series of options if needed. They will apply only to that chunk. To make a figure twelve inches wide and nine inches high we say e.g. {r example, fig.width = 12, fig.height = 9} in the braces section.

You will often need to save your figures individually, as they will end up being dropped into slides or published in papers that are not produced using RMarkdown. Saving a figure to a file can be done in several different ways. When working with ggplot, the easiest way is to use the ggsave() function. To save the most recently displayed figure, we provide the name we want to save it under:

ggsave(filename = "my_figure.png")

ThisSeveral other file formats are available as well. See the function’s help page for details. will save the figure as a PNG file, a format suitable for displaying on web pages. If you want a PDF instead, change the extension of the file:

ggsave(filename = "my_figure.pdf")

Remember that, for convenience, you do not need to write filename = as long as the name of the file is the first argument you give ggsave(). You can also pass plot objects to ggsave(). For example, we can put our recent plot into an object called p_out and then tell ggave() that we want to save that object.

p_out <- p + geom_point() +
    geom_smooth(method = "loess") +
    scale_x_log10()

ggsave("my_figure.pdf", plot = p_out)

When saving your work, it is sensible to have a subfolder (or more than one, depending on the project) where you save only figures. You should also take care to name your saved figures in a sensible way. fig_1.pdf or my_figure.pdf are not good names. Figure names should be compact but descriptive, and consistent between figures within a project. In addition, although it really shouldn’t be the case in this day and age, it is also wise to play it safe and avoid file names containing characters likely to make your code choke in future. These include apostrophes, backticks, spaces, forward and back slashes, and quotes.

The Appendix contains a short discussion of how to organize your files within your project folder. Treat the project folder as the home base of your work for the paper or work you are doing, and put your data and figures in subfolders within the project folder. To begin with, using your file manager, create a folder named “figures” inside your project folder. When saving figures, you can use Kirill Müller’s handy here library to make it easier to work with files and subfolders while not having to type in full file paths. Load the library in the setup chunk of your RMarkdown document. When you do, it tells you where “here” is for the current project. You will see a message saying something like this, with your file path and user name instead of mine:

# here() starts at /Users/kjhealy/projects/socviz

You can then use the here() function to make loading and saving your work more straightforward and safer. Assuming a folder named “figures” exists in your project folder, you can do this:

ggsave(here("figures", "lifexp_vs_gdp_gradient.pdf"), plot = p_out)

This saves p_out as a file called lifeexp_vs_gdp_gradient.pdf in the figures directory here, i.e. in your current project folder.

You can save your figure in a variety of different formats, depending on your needs (and also, to a lesser extent, on your particular computer system). The most important distinction to bear in mind is between vector formats and raster formats. A file with a vector format, like PDF or SVG, is stored as a set of instructions about lines, shapes, colors, and their relationships. The viewing software (such as Adobe Acrobat or Apple’s Preview application for PDFs) then interprets those instructions and displays the figure. Representing the figure this way allows it to be easily resized without becoming distorted. The underlying language of the PDF format is Postscript, which is also the language of modern typesetting and printing. This makes a vector-based format like PDF the best choice for submission to journals.

A raster based format, on the other hand, stores images essentially as a grid of pixels of a pre-defined size with information about the location, color, brightness, and so on of each pixel in the grid. This makes for more efficient storage, especially when used in conjunction with compression methods that take advantage of redundancy in images in order to save space. Formats like JPG are compressed raster formats. A PNG file is a raster image format that supports lossless compression. For graphs containing an awful lot of data, PNG files will tend to be much smaller than the corresponding PDF. However, raster formats cannot be easily resized. In particular they cannot be expanded in size without becoming pixelated or grainy. Formats like JPG and PNG are the standard way that images are displayed on the web. The more recent SVG format is vector-based format but also nevertheless supported by many web browsers.

In general you should save your work in several different formats. When you save in different formats and in different sizes you may need to experiment with the scaling of the plot and the size of the fonts in order to get a good result. The scale argument to ggsave() can help you here (you can try out different values, like scale=1.3, scale=5, and so on). You can also use ggave() to explicitly set the height and width of your plot in the units that you choose.

ggsave(here("figures", "lifexp_vs_gdp_gradient.pdf"),
       plot = p_out, height = 8, width = 10, units = "in")

Now that you know how to do that, let’s get back to making more graphs.

3.8 Where to go next

Start by playing around with the gapminder data a little more. You can try each of these explorations with geom_point() and then with geom_smooth(), or both together.

  • What happens when you put the geom_smooth() function before geom_point() instead of after it? What does this tell you about how the plot is drawn? Think about how this might be useful when drawing plots.
  • Change the mappings in the aes() function so that you plot Life Expectancy against population (pop) rather than per capita GDP. What does that look like? What does it tell you about the unit of observation in the dataset?
  • Try some alternative scale mappings. Besides scale_x_log10() you can try scale_x_sqrt() and scale_x_reverse(). There are corresponding functions for y-axis transformations. Just write y instead of x. Experiment with them to see what sort of effect they have on the plot, and whether they make any sense to use.
  • What happens if you map color to year instead of continent? Is the result what you expected? Think about what class of object year is. Remember you can get a quick look at the top of the data, which includes some shorthand information on the class of each variable, by typing gapminder.
  • Instead of mapping color = year, what happens if you try color = factor(year)?
  • As you look at these different scatterplots, think about Figure 3.13 a little more critically. We worked it up to the point where it was reasonably polished, but is it really the best way to display this country-year data? What are we gaining and losing by ignoring the temporal and country-level structure of the data? How could we do better? Sketch out what an alternative visualization might look like.

As you begin to experiment, remember two things. First, it’s always worth trying something, even if you’re not sure what’s going to happen. Don’t be afraid of the console.This license does not extend to, for example, overwriting or deleting your data by mistake. You should still manage your project responsibly, at a minimum keeping good backups. But within R and at the level of experimenting with graphs at the console, you have a lot of freedom. The nice thing about making your graphics through code is that you won’t break anything you can’t reproduce. If something doesn’t work, you can figure out what happened, fix things, and re-run the code to make the graph again.

Second, remember that the main flow of action in ggplot is always the same. You start with a table of data, you map the variables you want to display to aesthetics like position, color, or shape, and you choose one or more geoms to draw the graph. In your code this gets accomplished by making an object with the basic information about data and mappings, and then adding or layering additional information as needed. Once you get used to this way of thinking about your plots, especially the aesthetic mapping part, then drawing them becomes easier. Instead of having to think about how to draw particular shapes or colors on the screen, the many geom_ functions take care of that for you. In the same way, learning new geoms is easier once you think of them as ways to display aesthetic mappings that you specify. Most of the learning curve with ggplot involves getting used to this way of thinking about your data and its representation in a plot. In the next chapter, we will flesh out these ideas a little more, cover some common ways plots go “wrong” (i.e., when they end up looking strange), and learn how to recognize and avoid those problems.