A Appendix

This Appendix contains supplemental information about various aspects of R and ggplot that you are likely to run into as you use it. Think of it as the beginning of that process—sometimes annoying, but encountered by absolutely everyone—of discovering practical problems that are an inevitable part of using software, but whose solutions provide you, piece by piece, with more knowledge about what you are doing, and more confidence about how to tackle the next one that comes along.

A.1 The Basics of Accessing and Selecting Elements

Generally speaking, the tidyverse’s preferred methods for data subsetting, filtering, slicing and selecting will keep you away from the underlying mechanics of selecting and extracting elements of vectors, matrices, or tables of data. Carrying out these operations through functions like select(), filter(), subset(), and merge() is generally safer and more reliable than accessing elements directly. However, it is worth knowing the basics of these operations. Sometimes accessing elements directly is the most convenient thing to do. More importantly, we may use these techniques in small ways in our code with some regularity. This Appendix provides the briefest of introductions to R’s selection operators for vectors, arrays, and tables.

Consider the my_numbers and your_numbers vectors again.

my_numbers <- c(1, 2, 3, 1, 3, 5, 25)
your_numbers <- c(5, 31, 71, 1, 3, 21, 6)

To access any particular element in my_numbers, we use square brackets. Square brackets are not like the parentheses after functions. They are used to pick out an element indexed by its position:

my_numbers[4]

## [1] 1

my_numbers[7]

## [1] 25

Putting the number n inside the brackets will give us (or “return”) the nth element in the vector, assuming there is one. To access a sequence of elements within a vector we can do this:

my_numbers[2:4]

## [1] 2 3 1

This shorthand notating tells R to count from the 2nd to the 4th element, inclusive. We are not restricted to selecting contiguous elements, either. We can make use of our c() function again:

my_numbers[c(2, 4)]

## [1] 2 1

R evaluates the expression c(2,4) first, and then extracts just the second and the fourth element from my_numbers, ignoring the others. You might wonder why we didn’t just write my_numbers[2,3] directly. The answer is that this notation is used for objects arrayed in two dimensions, like matrices or data frames. We can make a two-dimensional object by creating two different vectors with the c() function and using the data.frame()function to collect them together:

my_df <- data.frame(
    mine = c(1,4,5, 8:11),
    yours = c(3,20,16, 34:31))

class(my_df)

## [1] "data.frame"

my_df

##   mine yours
## 1    1     3
## 2    4    20
## 3    5    16
## 4    8    34
## 5    9    33
## 6   10    32
## 7   11    31

We index data frames and other arrays by row first, and then by column:

my_df[3,1] # Row 3 Col 1

## [1] 5

my_df[1,2] # Row 1, Col 2 

## [1] 3

However, because our columns have names, we can access them through those as well. We do this by putting the name of the column in quotes where we previously put the index number of the column:

my_df[3,"mine"] # Row 3 Col 1

## [1] 5

my_df[1,"yours"] # Row 1, Col 2 

## [1] 3

my_df[3,"mine"] # Row 3 Col 1

## [1] 5

my_df[1,"yours"] # Row 1, Col 2 

## [1] 3

If we want to get all the elements of a particular column, we can leave out the row index and this will mean all that columns rows get included.

my_df[,"mine"] # All rows, Col 1

## [1]  1  4  5  8  9 10 11

We can do this the other way around, too:

my_df[4,] # Row 4, all cols

##   mine yours
## 4    8    34

A better way of accessing particular columns in a data frame is via the $ operator, which we use to extract components of objects. This way we just append the name of the column we want to the object it is inside:

my_df$mine

## [1]  1  4  5  8  9 10 11

Elements of many other objects can be extracted in this way, too, including nested objects.

out <- lm(mine ~ yours, data = my_df)

out$coefficients

## (Intercept)       yours 
##  -0.0801192   0.2873422

out$call

## lm(formula = mine ~ yours, data = my_df)

out$qr$rank # nested 

## [1] 2

Finally, in the case of data frames the $ operator lets us easily add new columns to the object. For example, we can add the first two columns together, row by row:

my_df$ours <- my_df$mine + my_df$yours
my_df

##   mine yours ours
## 1    1     3    4
## 2    4    20   24
## 3    5    16   21
## 4    8    34   42
## 5    9    33   42
## 6   10    32   42
## 7   11    31   42

In this book we do not generally access data via [ or $. It is particularly bad practice to access elements by their index number only, as opposed to using names. It is too easy to make a mistake in this way. But we use the c() function for small tasks quite regularly, so it’s worth understanding how it can be used to pick out elements from vectors.

A.2 Tidy your Data

Working with R and ggplot is much easier if the data you use is in the right shape. Ggplot wants your data to be tidy. For a more thorough introduction to the idea of tidy data, see Chapters 5 and 12 of Wickham and Grolemund (2016). To get a sense of what a tidy dataset looks like in R, we will follow the discussion in Wickham (2014). In a tidy dataset,

  1. Each variable is a column.
  2. Each observation is a row.
  3. Each type of observational unit forms a table.

For most of your data analysis, the first two points are the most important. This third one might be a little unfamiliar. It is a feature of “normalized” data from the world of databases, where the goal is to represent data in a series of related tables with minimal duplication (Codd 1990). Data analysis more usually works with a single large table of data, often with considerable duplication of some variables down the rows.

Data presented in summary tables is often not “tidy” as defined here. When structuring our data we need to be clear about how our data is arranged. If your data is not tidily arranged, the chances are good that you will have more difficulty—maybe a lot more difficulty—getting ggplot to draw the figure you want.

Table A.1: Some untidy data.

name treatmenta treatmentb
John Smith NA 18
Jane Doe 4 1
Mary Johnson 6 7

Table A.2: The same data, still untidy, but in a different way.

treatment John Smith Jane Doe Mary Johnson
a NA 4 6
b 18 1 7

For example consider Table A.1 and Table A.2 from Wickham’s discussion. They present the same data in different ways, but each would cause trouble if we tried to work with it in ggplot to make a graph. Table A.3 shows the same data once again, this time in a tidied form.

Hadley Wickham notes five main ways tables of data tend not to be tidy:

  1. Column headers are values, not variable names.
  2. Multiple variables are stored in one column.
  3. Variables are stored in both rows and columns.
  4. Multiple types of observational units are stored in the same table.
  5. A single observational unit is stored in multiple tables.

Table A.3: Tidied data. Every variable a column, every observation a row.

name treatment n
Jane Doe a 4
Jane Doe b 1
John Smith a NA
John Smith b 18
Mary Johnson a 6
Mary Johnson b 7

Data comes in an untidy form all the time, often for the good reason that it can be presented that way using much less space, or with far less repetition of labels and row elements. Figure A.1 shows a the first few rows of a table of U.S. Census Bureau data about educational attainment in the United States. To begin with, it’s organized as a series of sub-tables down the spreadsheet, broken out by age and sex. Second, the underlying variable of interest, “Years of School Completed” is stored across several columns, with an additional variable (level of schooling) included across the columns also. Getting the table into a slightly more regular format–eliminating the blank rows, and explicitly naming the sub-table rows—is not that difficult, to the point where we can read it in as an Excel or CSV file.

Figure A.1: Untidy data from the Census.

Untidy data from the Census.
## # A tibble: 366 x 11
##      age   sex  year total elem4 elem8   hs3   hs4 coll3 coll4 median
##    <chr> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1 25-34  Male  2016 21845   116   468  1427  6386  6015  7432     NA
##  2 25-34  Male  2015 21427   166   488  1584  6198  5920  7071     NA
##  3 25-34  Male  2014 21217   151   512  1611  6323  5910  6710     NA
##  4 25-34  Male  2013 20816   161   582  1747  6058  5749  6519     NA
##  5 25-34  Male  2012 20464   161   579  1707  6127  5619  6270     NA
##  6 25-34  Male  2011 20985   190   657  1791  6444  5750  6151     NA
##  7 25-34  Male  2010 20689   186   641  1866  6458  5587  5951     NA
##  8 25-34  Male  2009 20440   184   695  1806  6495  5508  5752     NA
##  9 25-34  Male  2008 20210   172   714  1874  6356  5277  5816     NA
## 10 25-34  Male  2007 20024   246   757  1930  6361  5137  5593     NA
## # ... with 356 more rows

The tidyverse has several tools to help you get the rest of the way in converting your data from an untidy to a tidy state. These can mostly be found in the tidyr and dplyr libraries. The former provides functions for converting, for example, wide-format data to long-format data, as well as assisting with the business of splitting and combining variables that are untidily stored. The latter has a tools that allow tidy tables to be further filtered, sliced, and analyzed at different grouping levels, as we have seen throughout this book.

With our edu object, we can use the gather() function to transform the schooling variables into a key-value arrangement. The key is the underlying variable, and the value is the value it takes for that observation. We create a new object, edu_t in this way, dropping the median variable in the process..

edu_t <- gather(data = edu,
                key = school,
                value = freq,
                elem4:coll4)

head(edu_t) 

## # A tibble: 6 x 7
##     age   sex  year total median school  freq
##   <chr> <chr> <int> <int>  <dbl>  <chr> <dbl>
## 1 25-34  Male  2016 21845     NA  elem4   116
## 2 25-34  Male  2015 21427     NA  elem4   166
## 3 25-34  Male  2014 21217     NA  elem4   151
## 4 25-34  Male  2013 20816     NA  elem4   161
## 5 25-34  Male  2012 20464     NA  elem4   161
## 6 25-34  Male  2011 20985     NA  elem4   190

tail(edu_t) 

## # A tibble: 6 x 7
##     age    sex  year total median school  freq
##   <chr>  <chr> <int> <int>  <dbl>  <chr> <dbl>
## 1   55> Female  1959 16263    8.3  coll4   688
## 2   55> Female  1957 15581    8.2  coll4   630
## 3   55> Female  1952 13662    7.9  coll4   628
## 4   55> Female  1950 13150    8.4  coll4   436
## 5   55> Female  1947 11810    7.6  coll4   343
## 6   55> Female  1940  9777    8.3  coll4   219

The educational categories previously spread over the columns have been gathered into two new columns. The school variable is the key column. It contains all of the education categories that were previously given across the column headers, from 0-4 years of elementary school to four or more years of college. They are now stacked up on top of each other in the rows. The freq variable is the value column, and contains the unique value of schooling for each level of that variable.

Once our data is in this long-form shape, it is ready for easy use with ggplot and related tidyverse tools. The Gapminder data, which is used throughout the book, is in this format, too:

library(gapminder)
gapminder

## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952    28.8  8425333       779
##  2 Afghanistan      Asia  1957    30.3  9240934       821
##  3 Afghanistan      Asia  1962    32.0 10267083       853
##  4 Afghanistan      Asia  1967    34.0 11537966       836
##  5 Afghanistan      Asia  1972    36.1 13079460       740
##  6 Afghanistan      Asia  1977    38.4 14880372       786
##  7 Afghanistan      Asia  1982    39.9 12881816       978
##  8 Afghanistan      Asia  1987    40.8 13867957       852
##  9 Afghanistan      Asia  1992    41.7 16317921       649
## 10 Afghanistan      Asia  1997    41.8 22227415       635
## # ... with 1,694 more rows

A.3 Common Problems Reading in Data

Date formats

Date formats can be annoying. First, times and dates must be treated differently from ordinary numbers. Second, there are many, many different date formats, differing both in the precision with which they are stored and the convention they follow about how to display years, months, days, and so on. Consider the following data:

head(bad_date)

## # A tibble: 6 x 2
##     date     N
##    <chr> <int>
## 1 9/1/11 44426
## 2 9/2/11 55112
## 3 9/3/11 19263
## 4 9/4/11 12330
## 5 9/5/11  8534
## 6 9/6/11 59490

The data in the date column has been read in as a character string, but we want R to treat it as a date. If can’t treat it as a date, we get bad results.

A bad date. Figure A.2: A bad date.

p <- ggplot(data = bad_date, aes(x = date, y = N))
p + geom_line()

## geom_path: Each group consists of only one observation.
## Do you need to adjust the group aesthetic?

What has happened? The problem is that ggplot doesn’t know date consists of dates. As a result, when we ask to plot it on the x-axis, it tries to treat the unique elements of date like a categorical variable instead—that is, as a factor. But because each date is unique, its default effort at grouping the data results in every group having only one observation in it (i.e., that particular row). The ggplot function knows something is odd about this, and tries to let you know. It wonders whether we’ve failed to set group = <something> in our mapping.

For the sake of it, let’s see what happens when the bad date values are not unique. We will make a new data frame by stacking two copies of the data on top of each other. The rbind() function does this for us. We end up with two copies of every observation.

Still bad. Figure A.3: Still bad.

bad_date2 <- rbind(bad_date, bad_date)

p <- ggplot(data = bad_date2, aes(x = date, y = N))
p + geom_line()

Now ggplot doesn’t complain at all, because there’s more than one observation per (inferred) group. But the plot is still wrong!

We will fix this problem using the lubridate library. It provides a suite of convenience functions for converting date strings in various formats and with various separators (such as / or - and so on) into objects of class Date that R knows about. Here our bad dates are in a month/day/year format, so we use mdy(). Consult the lubridate library’s documentation to learn more about similar convenience functions for converting character strings where the date components appear in a different order.

# install.packages("lubridate")
library(lubridate)

bad_date$date <- mdy(bad_date$date)
head(bad_date)

## # A tibble: 6 x 2
##         date     N
##       <date> <int>
## 1 2011-09-01 44426
## 2 2011-09-02 55112
## 3 2011-09-03 19263
## 4 2011-09-04 12330
## 5 2011-09-05  8534
## 6 2011-09-06 59490

Now our filldate_new variable has a Date class. Let’s try the plot again.

Much better. Figure A.4: Much better.

p <- ggplot(data = bad_date, aes(x = date, y = N))
p + geom_line()

Year-only dates

Many variables are measured by the year and supplied in the data as a four digit number rather than as a date. This can sometimes cause headaches when we want to plot year on the x-axis. It happens most often when the time series is relatively short. Consider this data:

url <- "https://cdn.rawgit.com/kjhealy/viz-organdata/master/organdonation.csv"
bad_year<- read_csv(url)

bad_year %>% select(1:3) %>% sample_n(10)

## # A tibble: 10 x 3
##      country  year donors
##        <chr> <int>  <dbl>
##  1   Belgium  1994  22.80
##  2   Germany  1998  13.40
##  3   Finland  1998  19.80
##  4    Sweden    NA     NA
##  5   Finland  1992  19.40
##  6   Germany  2000  12.50
##  7   Finland  2001  17.00
##  8 Australia  1992  12.35
##  9     Italy  1997  11.60
## 10    Sweden  1992  14.90

It’s a version of organdata but in a less clean format. Notice that the year variable is an integer (its class is <int>) and not a date. Let’s say we want to plot donation rate against year.

Integer year shown with a decimal point. Figure A.5: Integer year shown with a decimal point.

p <- ggplot(data = bad_year, aes(x = year, y = pop))
p + geom_point()

## Warning: Removed 34 rows containing missing values
## (geom_point).

The decimal point on the x-axis labels is unwanted. We could sort this out cosmetically, by giving scale_x_continuous() a set of breaks and labels that represent the years as characters. Alternatively, we can change the class of the year variable. For convenience, we will tell R that the year variable should be treated as a Date measure, and not an integer. We’ll use a home-cooked function, int_to_year(), that takes integers and converts them to dates.

bad_year$year <- int_to_year(bad_year$year)
bad_year %>% select(1:3)

## # A tibble: 238 x 3
##      country       year donors
##        <chr>     <date>  <dbl>
##  1 Australia         NA     NA
##  2 Australia 1991-01-01  12.09
##  3 Australia 1992-01-01  12.35
##  4 Australia 1993-01-01  12.51
##  5 Australia 1994-01-01  10.25
##  6 Australia 1995-01-01  10.18
##  7 Australia 1996-01-01  10.59
##  8 Australia 1997-01-01  10.26
##  9 Australia 1998-01-01  10.48
## 10 Australia 1999-01-01   8.67
## # ... with 228 more rows

In the process, today’s day and month are introduced into the year data, but that is irrelevant for our purposes, given that our data are only observed in a yearly window to begin with, and the specific day and month are immaterial.

A.4 Write Functions for Repetitive Tasks

If you are working with a data set that you will be making a lot of similar plots from, or will need to periodically look at in a way that is repetitive but can’t be carried out in a single step once and for all, then the chances are that you will start accumulating sequences of code that you find yourself using repeatedly. When this happens, the temptation will be to start copying and pasting these sequences from one analysis to the next. We can see something of this tendency in the code samples for this book. To make the exposition clearer, we have periodically repeated chunks of code that differ only in the dependent or independent variable being plotted.

You should try to avoid copying and pasting code repeatedly in this way. Instead, this is an opportunity to write a function to help you out a little. More or less everything in R is accomplished through functions, and it’s not too difficult to write your own. This is especially the case when you begin by thinking of functions as a way to help you automate some local or smaller task, rather than a means of accomplishing some very complex task. R has the resources to help you build complex functions and function libraries—like ggplot itself—but we can start quite small, with functions that help us manage a particular dataset or data analysis.

Here, for instance, is a function that will make a scatter plot for any Section in the ASA data, or optionally fit a smoother to the data and plot that instead. Defining a function looks a little like calling one, except that we spell out the steps inside. Notice how we also specify the default arguments.

plot.section <- function(section="Culture", x = "Year",
                         y = "Members", data = asasec,
                         smooth=FALSE){
    require(ggplot2)
    require(splines)
    ## Note use of aes_string() rather than aes() 
    p <- ggplot(subset(data, Sname==section),
            mapping = aes_string(x=x, y=y))

    if(smooth == TRUE) {
        p0 <- p + geom_smooth(color = "#999999",
                              size = 1.2, method = "lm",
                              formula = y ~ ns(x, 3)) +
            scale_x_continuous(breaks = c(seq(2005, 2015, 4))) +
            labs(title = section)
    } else {
    p0 <- p + geom_line(color= "#E69F00", size=1.2) +
        scale_x_continuous(breaks = c(seq(2005, 2015, 4))) +
        labs(title = section)
    }

    print(p0)
}

This function is not very general. Nor is it particularly robust. But for the use we want to put it to, it works just fine.

plot.section("Rationality")
plot.section("Sexualities", smooth = TRUE)

Figure A.6: Using a function to plot your results.

Using a function to plot your results.Using a function to plot your results.

If we were going to work with this data for long enough, we could make the function progressively more general. For example, we can add the special ... argument (which means, roughly, “and any other named arguments”) in a way that allows us to pass arguments through to the geom_smooth() function in the way we’d expect if we were using it directly. With that in place, we can pick the smoothing method we want.

plot.section <- function(section="Culture", x = "Year",
                         y = "Members", data = asasec,
                         smooth=FALSE, ...){
    require(ggplot2)
    require(splines)
    ## Note use of aes_string() rather than aes() 
    p <- ggplot(subset(data, Sname==section),
            mapping = aes_string(x=x, y=y))

    if(smooth == TRUE) {
        p0 <- p + geom_smooth(color = "#999999",
                              size = 1.2, ...) +
            scale_x_continuous(breaks = c(seq(2005, 2015, 4))) +
            labs(title = section)
    } else {
    p0 <- p + geom_line(color= "#E69F00", size=1.2) +
        scale_x_continuous(breaks = c(seq(2005, 2015, 4))) +
        labs(title = section)
    }

    print(p0)
}

plot.section("Comm/Urban",
             smooth = TRUE,
             method = "loess")
plot.section("Children",
             smooth = TRUE,
             method = "lm",
             formula = y ~ ns(x, 2))

Figure A.7: Our custom function can now pass arguments along to fit different smoothers to Section membership data.

Our custom function can now pass arguments along to fit different smoothers to Section membership data.Our custom function can now pass arguments along to fit different smoothers to Section membership data.

A.5 How to Save your Work

When you’re working interactively with R and ggplot, you can see the results of your plots right away via the graphics device that is displaying output on the screen for you. Similarly, if you’re writing your code in an .Rmd file, the plots you make will be reproduced in there. (The PDF version of this book is in effect one large such file.) This keeps us away from manipulating or adjusting plots directly on the screen, via our mouse or trackpad. That’s good, because work of that sort is very hard to reproduce, and you will forget how you accomplished the end result you got. It’s much better, and more robust, to do as much as you can programatically, through code, from start to finish.

You will, however, very often need to save your figures individually. They will end up being dropped into slides or published in papers, and ultimately become detached from the code that produced them. This is OK, as long as you still have that code to hand in order to do it again.

Saving a figure to a file can be done in several different ways. When working with ggplot, the easiest way is to use the ggsave() function. To save the most recently displayed figure:

ggsave("my-figure.png")

Or, for a PDF:

ggsave("my-figure.pdf")

You can also pass plot objects to ggsave(), like this:

ggsave("my-figure.pdf", plot = p5)

When saving your work, as we mentioned at the very beginning, it’s sensible to have a subfolder (or more than one, depending on the project) where you save only figures. You should also take care to name your saved figures in a sensible way. “Fig 3” or “My Figure” are not good names. Figure names should be compact but descriptive, consistent between figures within a project, and—this really shouldn’t be the case in this day and age, but it is—avoid using characters likely to make your code choke in future. These include apostrophes, backticks, spaces, forward and back slashes, and quotes.

## Assuming 'figures/' exists in the working directory
ggsave("figures/gapminder-lifexp-vs-gdp-smoothed.pdf", plot = p5)

You can save your figure in a variety of different formats, depending on your needs (and also, to a lesser extent, on your particular computer system). The most important distinction to bear in mind is between vector formats and raster formats. A vector format, like PDF or SVG, are stored as a set of instructions about lines, shapes, colors, and their relationships. The viewing software (such as Adobe Acrobat or Apple’s Preview application for PDFs) then interprets those instructions and displays the figure. Representing the figure this way allows it to be easily resized without becoming distorted. The underlying language of the PDF format is Postscript, which is also the language of modern typesetting and printing. This makes a vector-based format like PDF the best choice for submission to journals.

A raster based format, on the other hand, stores images essentially as a grid of pixels of a pre-defined size with information about the location, hue, brightness, and so on of each pixel in the grid. This makes for a more efficient mode of storage, especially when used in conjunction with compression methods that take advantage of redundancy in images in order to save space. Formats like JPG are compressed raster formats. A PNG file is a raster image format that supports lossless compression. For graphs containing an awful lot of data, PNG files will tend to be much smaller than the corresponding PDF. However, raster formats cannot be easily resized, and especially expanded in resolution, without becoming pixelated or grainy. Formats like JPG and PNG are the standard way that images are displayed on the web, although increaingly the SVG format—a vector-based format that is nevertheless supported by browsers—is slowly becoming more common. ggsave suports SVG as well.

In general you should happily save your work in several different formats. Note that when you save in different formats and in different sizes you may need to experiment with the scaling of the plot and the size of the fonts in order to get a good result. The scale argument to ggsave() can help you here (you can try out different values, like scale=1.3, scale=5, etc).

A.6 RMarkdown and knitr

Markdownen.wikipedia.org/wiki/Markdown is a loosely-standardized way of writing plain text that includes information about the formatting of your document. It was originally developed by John Gruber, with input from Aaron Swartz. The aim was to make a simple format that could incorporate some structural information about the document (such as headings and subheadings, emphasis, hyperlinks, lists, footnotes, and so on), with minimal loss of readability in plain-text form. A plain-text format like HTML is much more extensive and well-defined than Markdown, but Markdown was meant to be simple. Over the years, and despite various weaknesses, it has become a de facto standard. Text editors and note-taking applications support it, and tools exist to convert Markdown not just into HTML (its original target output format) but many other document types as well. The most powerful of these is Pandocpandoc.org, which can get you from markdown to many other formats (and vice versa). Pandoc is what powers RStudio’s ability to convert your notes to HTML, Microsoft Word, and PDF documents.

Chapter 1 of this book encourages you to take notes and organize your analysis using RMarkdownrmarkdown.rstudio.com and (behind the scenes) knitryihui.name/knitr. These are R libraries that RStudio makes easy to use. RMarkdown extends Markdown by letting you intersperse your notes with chunks of R code. Code chunks can have labels and a few options that determine how they will behave when the file is processed. After writing your notes and your code, you knit the document (Xie 2015). That is, you feed your .Rmd file to R, which processes the code chunks, and produces a new .md where the code chunks have been replaced by their output. You can then turn that Markdown file into a more readable PDF or HTML document, or the Word document that a journal demands you send them.

Behind the scenes in RStudio, this is all done using the knitr and rmarkdown libraries. The latter provides a render() function that takes you from .Rmd to HTML or PDF in a single step. Conversely, if you just want to extract the code you’ve written from the surrounding text, then you “tangle” the file, which results in an .R file. The strength of this approach is that is makes it much easier to document your work properly. There is just one file for both the data analysis and the writeup. The output of the analysis is created on the fly, and the code to do it is embedded in the paper. If you need to do multiple but identical (or very similar) analyses of different bits of data, RMarkdown and knitr can make generating consistent and reliable reports much easier.

Pandoc’s flavor of Markdown—the one used in knitr and RStudio—allows for a wide range of markup, and can handle many of the nuts and bolts of scholarly writing, such as complex tables, citations, bibliographies, references, and mathematics. In addition to being able to produce documents in various file formats, it can also produce many different kinds of document, from articles and handouts to websites and slide decks. RStudio’s RMarkdown website has extensive documentation and examples on the ins and outs of RMarkdown’s capabilities, including information on customizing it if you wish.

Writing your notes and papers in a plain text format like this has many advantages. It keeps your writing, your code, and your results closer together, and allows you to use powerful version control methods to keep track of your work and your results. Errors in data analysis often well up out of the gap that typically exists between the procedure used to produce a figure or table in a paper and the subsequent use of that output later. In the ordinary way of doing things, you have the code for your data analysis in one file, the output it produced in another, and the text of your paper in a third file. You do the analysis, collect the output and copy the relevant results into your paper, often manually reformatting them on the way. Each of these transitions introduces the opportunity for error. In particular, it is easy for a table of results to get detached from the sequence of steps that produced it. Almost everyone who has written a quantitative paper has been confronted with the problem of reading an old draft containing results or figures that need to be revisited or reproduced (as a result of peer-review, say) but which lack any information about the circumstances of their creation. Academic papers take a long time to get through the cycle of writing, review, revision, and publication, even when you’re working hard the whole time. It is not uncommon to have to return to something you did two years previously in order to answer some question or other from a reviewer. You do not want to have to do everything over from scratch in order to get the right answer. I am not exaggerating when I say that, whatever the challenges of replicating the results of someone else’s quantitative analysis, after a fairly short period of time authors themselves find it hard to replicate their own work. Bit-rot is the term of art in Computer Science for the seemingly inevitable process of decay that overtakes a project just because you left it alone on your computer for six months or more.

For small and medium-sized projects, plain text approaches that rely on RMarkdown documents and the tools described here work well. Things become a little more complicated as projects get larger. (This is not an intrinsic flaw of plain-text methods, by the way. It is true no matter how you choose to organize your project.) In general, it is worth trying to keep your notes and analysis in as standardized and simple a format you can, for as long as you can. The final outputs of projects (such as journal articles or books) tend, as they approach completion, to descend into a rush of specific fixes and adjustments, all running against the ideal of a fully portable, reproducible analysis. It is best to put off that phase for as long as you can.

A.7 Preparing the County-Level Maps

The U.S. county-level maps in the socviz library were prepared using shapefiles from the U.S. Census Bureau that were converted to GeoJSON format by Eric Celeste.eric.clst.org/Stuff/USGeoJSON The code to prepare the imported shapefile was written by Bob Rudis, and draws on the rgdal library to do the heavy lifting of importing the shapefile and transforming the projection. Bob’s code extracts the (county-identifying) rownames from the imported spatial data frame, and then moves Alaska and Hawaii to new locations in the bottom left of the map area, so that we can map all fifty states instead of just the lower forty eight.

First we read in the file, set the projection, and set up an identifying variable we can work with later on to merge in data. Note that the CRS() call is a single long line of text. This is conventionally indicated by the \ character at the end of the line. Do not type this if you write out this code yourself.

us.counties <- readOGR(dsn="data/geojson/gz_2010_us_050_00_5m.json",
                       layer="OGRGeoJSON")

us.counties.aea <- spTransform(us.counties,
                    CRS("+proj=laea +lat_0=45 +lon_0=-100 \
                         +x_0=0 +y_0=0 +a=6370997 +b=6370997 \
                         +units=m +no_defs"))

us.counties.aea@data$id <- rownames(us.counties.aea@data)

Then we extract, rotate, shrink and move Alaska, resetting the projection in the process, and also move Hawaii. The areas are identified by their State FIPS codes. We remove the old states and put the new ones back in, and remove Puerto Rico as our examples lack data for this region. If you have data for the area, you can move it between Texas and Florida.

alaska <- us.counties.aea[us.counties.aea$STATE=="02",]
alaska <- elide(alaska, rotate=-50)
alaska <- elide(alaska, scale=max(apply(bbox(alaska), 1, diff)) / 2.3)
alaska <- elide(alaska, shift=c(-2100000, -2500000))
proj4string(alaska) <- proj4string(us.counties.aea)

hawaii <- us.counties.aea[us.counties.aea$STATE=="15",]
hawaii <- elide(hawaii, rotate=-35)
hawaii <- elide(hawaii, shift=c(5400000, -1400000))
proj4string(hawaii) <- proj4string(us.counties.aea)

us.counties.aea <- us.counties.aea[!us.counties.aea$STATE %in% c("02", "15", "72"),]
us.counties.aea <- rbind(us.counties.aea, alaska, hawaii)

A.8 The Theme Used in this Book

The ggplot theme used in this book is derived principally from the work (again) of Bob Rudis. His hrbrthemes package provides theme_ipsum(), a compact theme that can be used with the Arial typeface or, in a variant, the freely available Roboto Condensed typeface. The main difference between the theme_book() used here and Rudis’s theme_ipsum() is the choice of typeface. The hrbrthemes package can be installed from GitHub in the usual way:

devtools::install_github("hrbrmstr/hrbrthemes")

Thegithub.com/kjhealy/myriad book theme is also available on github. Note that it does not include the font files themselves. These are available from Adobe, who make the typeface.

When drawing maps we also used a theme_map() function. This theme begins with the built-in theme_bw() and turns off most of the guide, scale, panel content that is not needed when presenting a map. The code looks like this:

theme_map <- function(base_size=9, base_family="") {
    require(grid)
    theme_bw(base_size=base_size, base_family=base_family) %+replace%
        theme(axis.line=element_blank(),
              axis.text=element_blank(),
              axis.ticks=element_blank(),
              axis.title=element_blank(),
              panel.background=element_blank(),
              panel.border=element_blank(),
              panel.grid=element_blank(),
              panel.spacing=unit(0, "lines"),
              plot.background=element_blank(),
              legend.justification = c(0,0),
              legend.position = c(0,0)
              )
}

Themes are functions, so to create a theme is to create a function. We give it a default base_size argument and an empty base_family argument (for the font family). Notice the %+replace% operator in the code. This is a convenience operator defined by ggplot and used for updating theme elements in bulk. Throughout the book we saw repeated use of the + operator to incrementally add to or tweak the content of a theme, as when we would do + theme(legend.position = "top"). Using + added the instruction to the theme, adjusting whatever was specified and leaving everything else as was. The %+replace% operator serves a similar function, but has a stronger effect. We begin with theme_bw() and then use a theme() statement to add new content, as usual. The %+replace% operator replaces the entire element specified, rather than adding to it. Any element not specified in the theme() statement will be deleted from the new theme. So this is a way to create themes by both starting from existing ones, specifying new elements, and deleting anything not explicitly mentioned. See the documentation for theme_get() for more details.

A.9 Where to Learn More

Learning ggplot should encourage you to learn more about the set of tidyverse tools, and then by extension to learn more about R. Here are some books and other resources you may find of use as you learn more about data visualization, R, and ggplot. Note that several of the books are available online in their entirety, either in PDF form (e.g. via LeanPub, or at Springer’s SpringerLink) or as websites.

Books

First, there are books closely related to the R and especially the tidyverse tools we have used:

  • Garrettr4ds.had.co.nz/ Grolemund and Hadley Wickham. 2016. R for Data Science. O’Reilly.
  • Winston S. Chang. 2013. The R Graphics Cookbook. O’Reilly.
  • Rogerleanpub.com/rprogramming Peng. 2016. R Programming for Data Science.
  • Hadley Wickham. 2016. ggplot2: Elegant Graphics for Data Analysis. Second edition. Springer.

Second, material that covers more of R’s capabilities and applications:

  • Chris Brundson and Lex Comber. 2015. An Introduction to R for Spatial Analysis and Mapping. Sage.
  • Peter W. Dalgaard. 2008. Introductory Statistics with R. 2nd Ed. Springer.
  • Tilman M. Davies. 2016. The Book of R. No Starch Press.
  • Michael Friendly and David Meyer. 2017. Discrete Data Analysis with R. CRC/Chapman and Hall.
  • Frank E. Harrell, Jr. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd Ed. Springer.
  • Kosuke Imai. 2017. Quantitative Social Science: An introduction. Princeton.
  • W.N. Venables and B.D. Ripley. 2002 Modern Applied Statistics with S. 4th Ed. Springer.

And third, some useful and important work on data visualization:

  • Jacques Bertin. 2010 [1967]. Semiology of Graphics. Esri Press.
  • William S. Cleveland. 1993. Visualizing Data. Hobart Press.
  • William S. Cleveland. 1994. The Elements of Graphing Data. Revised Edition. Hobart Press.
  • Stephen Few. 2009. Now You See It: Simple Visualization Techniques for Quantitative Analysis. Analytics Press.
  • Tamara Munzer. 2014. Visualization Analysis & Design. CRC Press.
  • Edward R. Tufte. 1983. The Visual Display of Quantitative Information. Graphics Press.
  • Colin Ware. 2008. Visual Thinking for Design. Morgan Kaufman.

Stack Sites

  • Stackstackoverflow.com Overflow. This is a programming and developer Question and Answer site. In effect, it is a huge database of real problems and solutions where both the questions and answers are posed by users of the site. It is maintained in a way that tries to maximize the usefulness of both questions and answers by allowing contributors to build their reputations on the basis of contributions that other users have found effective. All kinds of programming languages are covered, but searching with keywords, and adding tags enclosed in square brackets, e.g. [ggplot] or [git], will restrict results to the library or language you want answers in. Very often, straightforward google searches of errors or problem descriptions will lead to a highly-rated answer on Stack Overflow. In conjunction with more systematic guides or texts, it is a good place to begin learning more about the problems you run into.