Preface

You should look at your data. Graphs and charts let you explore and learn about the structure of the information you collect. Good data visualizations also make it easier to communicate your ideas and findings to other people. Beyond that, producing effective plots from your own data is the best way to develop a good eye for reading and understanding graphs—good and bad—made by others, whether presented in research articles, business slide decks, public policy advocacy, or media reports. This book teaches you how to do it.

My main goal is to introduce you to both the ideas and the methods of data visualization in a sensible, comprehensible, reproducible way. Some classic works on visualizing data, such as The Visual Display of Quantitative Information (Tufte 1983), present numerous examples of good and bad work together with some general taste-based rules of thumb for constructing and assessing graphs. In what has now become a large and thriving field of research, more recent work provides excellent discussions of the cognitive underpinnings of successful and unsuccessful graphics, again providing many compelling and illuminating examples (Ware 2008). Other books provide good advice about how to graph data under different circumstances (Cairo 2013; Few 2009; Munzer 2014), but choose not to teach the reader about the tools used to produce the graphics they show. This may be because the software used is some (proprietary, costly) point-and-click application that requires a fully visual introduction of its own, such as Tableau, Microsoft Excel, or SPSS. Or perhaps the necessary software is freely available, but showing how to use it is not what the book is about (Cleveland 1994). Conversely, there are excellent cookbooks that provide code “recipes” for many kinds of plot (Chang 2013). But for that reason they do not take the time to introduce the beginner to the principles behind the output they produce. Finally, we also have thorough introductions to particular software tools and libraries, including the one we will use in this book (Wickham 2016a). These can sometimes be hard for beginners to digest, as they may presuppose a background that the reader does not have.

Each of the books I have just cited is well worth your time. When teaching people how to make graphics with data, however, I have repeatedly found the need for an introduction that motivates and explains why you are doing something but that does not skip the necessary details of how to produce the images you see on the page. And so this book has two main aims. First, I want you get to the point where you can reproduce almost every figure in the text for yourself. Second, I want you to understand why the code is written the way it is, such that when you look at data of your own you can feel confident about your ability to get from a rough picture in your head to a high-quality graphic on your screen or page.

What you will Learn

This book is a hands-on introduction to the principles and practice of looking at and presenting data using R and ggplot. R is a powerful, widely used, and freely available programming language for data analysis. You may be interested in exploring ggplot after having used R before, or be entirely new to both R and ggplot and just want to graph your data. I do not assume you have any prior knowledge of R.

After installing the software we need, we begin with an overview of some basic principles of visualization. We focus not just on the aesthetic aspects of good plots, but on how their effectiveness is rooted in the way we perceive properties like length, absolute and relative size, orientation, shape, and color. We then learn how to produce and refine plots using ggplot2, a powerful, versatile, and widely-used visualization library for R (Wickham 2016a). The ggplot2 library implements a “grammar of graphics” (Wilkinson 2005). This approach gives us a coherent way to produce visualizations by expressing relationships between the attributes of data and their graphical representation.

Through a series of worked examples, you will learn how to build plots piece by piece, beginning with scatterplots and summaries of single variables, then moving on to more complex graphics. Topics covered include plotting continuous and categorical variables, layering information on graphics; faceting grouped data to produce effective “small multiple” plots; transforming data to easily produce visual summaries on the graph such as trend lines, linear fits, error ranges, and boxplots; creating maps, and also some alternatives to maps worth considering when presenting country- or state-level data. We will also cover cases where we are not working directly with a dataset, but rather with estimates from a statistical model. From there, we will explore the process of refining plots to accomplish common tasks such as highlighting key features of the data, labeling particular items of interest, annotating plots, and changing their overall appearance. Finally we will examine some strategies for presenting graphical results in different formats, and to different sorts of audiences.

If you follow the text and examples in this book, then by the end you will:

  • Understand the basic principles behind effective data visualization.
  • Have a practical sense for why some graphs and figures work well, while others may fail to inform or actively mislead.
  • Know how to create a wide range of plots in R using ggplot2.
  • Know how to refine plots for effective presentation.

Learning how to visualize data effectively is more than just knowing how to write code that produces figures from data. This book will teach you how to do that. But it will also teach you how to think about the information you want to show, and how to consider the audience you are showing it to—including the most common case, when the audience is yourself.

This book is not a comprehensive guide to R, or even a comprehensive survey of everything ggplot can do. Nor is it a cookbook containing just examples of specific things people commonly want to do with ggplot. (Both these sorts of books already exist: see the references in the Appendix.) Neither is it a detailed collection of rules, or a sequence of examples that you cannot reproduce. My goal is to get you up and running in R and making plots quickly and in a well-informed way, with a solid grasp of the core procedure—building up images layer by layer—that is at the heart of what ggplot does.

Learning ggplot does mean getting used to how R works, and also understanding how ggplot works with other tools in the language to get its job done. As you work your way through the book you will gradually learn more about some very useful idioms and functions in R for manipulating data. In particular you will learn about some of the tools provided by the tidyverse library that ggplot belongs to. Similarly, although this is not a cookbook, once you get past Chapter 2 you will see the code used to produce almost every figure in the book, and in most cases see the logic of these figures built up piece by piece. If you use the book as it is designed, by the end you will have the makings of a cookbook of your own, containing code you have written out yourself. And though we do not go into great depth on the topic of rules or principles of visualization, the discussion in Chapter 2 and its application throughout the book gives you more to think about than just a list of graph types. By the end of the book you should be able to look at a figure and be able to see it in terms of ggplot’s grammar, understanding how the various layers, shapes, and data are pieced together to make a finished plot.

The Right Frame of Mind

It can be a little disorienting to learn a programming language like R, mostly because at the beginning there seem to be so many pieces to fit together in order for things to work properly. The language has some possibly unfamiliar concepts that define how it works—things like “object”, or “function”, or “class”. The syntactic rules for writing code are annoyingly picky. Error messages seem obscure; help pages are terse; other people seem to have had not quite the same issue as you. Beyond that, you see that doing one thing often involves learning a bit about some other part of the language. To make a plot you need a table of data, but maybe you need to filter out some rows, recalculate some columns, or just get the computer to see it is there in the first place. And finally there is also a wider environment of supporting applications and tools that are good to know about as well—editors that highlight what you write; applications that help you organize your code and its output; ways of writing your code that let you keep track of what you have done and keep your notes close to their output. It can all seem a bit confusing.

Don’t panic. You have to start somewhere. Starting with graphics is more rewarding than some of the other places you might begin, because you will be able to see the results of your efforts very quickly. As you build your confidence and ability in this area, you will gradually see the other tools as things that help you sort out some issue, or solve a problem that’s stopping you from making the picture you want. That makes them easier to learn. As you acquire them piecemeal—perhaps initially using them without completely understanding what is happening—you will begin to see how they fit together, and be more confident of your own ability to do what you need to do.

Even better, in the past decade or so the world of data analysis and programming generally has opened up in a way that has made help much more easily available. Freely-available tools for coding have been around for a long time, but what you might call the “ecology of assistance” has gotten better. There are more resources available for learning the various pieces, and more of them are oriented to the way writing code actually happens most of the time—which is to say, iteratively, in an error-prone fashion, and with problems other people have likely run into and solved before.

Conventions

In this book, we alternate between regular text (like this), and samples of code that you can run yourself. Code you can type directly into R at the console will look like thisAdditional notes and information will sometimes appear in the margin, like this.:

## Inside chunks of code, lines beginning with the hash character are comments
my_numbers <- c(1, 1, 4, 1, 1, 4, 1)

When we write code that produces output at the console, it will look like this:

my_numbers

## [1] 1 1 4 1 1 4 1

In the main text, references to objects in R—like tables of data, variables, functions, and so on—will also appear in a monospaced typeface, like this: my_numbers.

Before you Begin

The book is designed for you to follow along in an active way, writing out the examples and experimenting with the code as you go. You will be able to reproduce almost all of the plots in the text. To do this, you will need to install some software first. Here is what to do:

  1. The most recent version of R. R is free and available for Windows, Mac, and Linux operating systems. Downloadcloud.r-project.org the version of R compatible with your operating system. If you are running Windows or MacOS, you should choose one of the precompiled binary distributions (i.e., ready-to-run applications) linked at the top of the R Project’s webpage.
  2. Oncerstudio.com R is installed, download and install R Studio. R Studio is an “Integrated Development Environment”, or IDE. This means it is a front-end for R that makes it much easier to work with. R Studio is also free, and available for Windows, Mac, and Linux platforms.
  3. Installtidyverse.org the tidyverse library and several other add-on packages for R. These libraries provide useful functionality that we will take advantage of throughout the book. You can learn more about the tidyverse’s family or packages at its website.

To install the tidyverse, make sure you have an Internet connection and then launch R Studio. Type the following lines of code and in to R’s command prompt, located in the window named “Console”. The <- arrow is made up of two keystrokes, first < and then the short dash or minus symbol, -.You could copy and paste it, if you like, but I strongly recommend typing all the code examples right from the beginning.

my_packages <- c("tidyverse", "broom", "coefplot", "cowplot",
                 "gapminder", "GGally", "ggrepel", "ggridges", "gridExtra",
                 "interplot", "margins", "maps", "mapproj", "mapdata",
                 "MASS", "quantreg", "scales", "survey", "srvyr",
                 "viridis", "viridisLite", "devtools")

install.packages(my_packages, repos = "http://cran.rstudio.com")

R Studio should then download and install these packages for you. It may take a little while to download everything.

With these packages available, you can then install one last library of material that’s useful specifically for this book. It is hosted on GitHub, rather than R’s central package repository, so we use a different function to fetch it.

devtools::install_github("kjhealy/socviz")

Once you’ve done that, we can get started.