6 Work with Models

Data visualization is about more than generating figures that display the raw numbers from a table of data. Right from the beginning, it involves producing figures that condense or summarize parts of the data, the better to represent trends, distributions, or comparisons.

An OLS vs robust regression comparison; a spline fit; and an additive quantile regression.An OLS vs robust regression comparison; a spline fit; and an additive quantile regression.An OLS vs robust regression comparison; a spline fit; and an additive quantile regression. Figure 6.1: An OLS vs robust regression comparison; a spline fit; and an additive quantile regression.

p <- ggplot(data = gapminder,
            mapping = aes(x = log(gdpPercap), y = lifeExp))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", size = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

Histograms, density plots, boxplots, and other geoms compute either single numbers or new variables before plotting them. As we saw in Section 4.4, these calculations are done by stat_ functions, each of which works hand-in-hand with its default geom_ function, and vice versa. Moreover, from the smoothing lines we drew from almost the very first plots we made, we have seen that stat_ functions can do a fair amount of calculation and even model estimation on the fly. The range of models that ggplot can directly fit to data is quite wide. The geom_smooth() function can take a range of method arguments to fit LOESS, OLS, and robust regression lines.

Both the geom_smooth() and geom_quantile() functions can also be instructed to use different formulas to produce their fits. In the top panel of Figure 6.1, we access the MASS library’s rlm function to fit a robust regression line. In the second panel, the bs function is invoked directly from the splines library in the same way, to fit a b-spline polynomial to the data. This is the same approach to directly accessing functions without loading a whole library that we have already used several times when using functions from the scales library. The geom_quantile() function, meanwhile, is like a specialized version of geom_smooth() that can fit quantile regression lines using a variety of methods. The quantiles argument takes a vector specifying the quantiles at which to fit the lines.

6.1 Show Several Fits at Once, with a Legend

As we just saw in the first panel of Figure 6.1, we can look at several fits at once on the same plot simply by layering on new smoothers with geom_smooth(). As long as we set the color and fill aesthetics to different values for each fit, we can easily distinguish them visually. However, ggplot will not draw a legend that guides us about which fit is which. This is because the smoothers are not logically connected to one another—they simply exist as separate layers. What if we are comparing several different fits and want a legend describing them?

As it turns out, geom_smooth() is clever enough to do this via the slightly unusual route of mapping the color and fill aesthetics to a string describing the model we are fitting, and then using scale_color_manual() and scale_fill_manual() to create the legend. First we use the RColorBrewer library’s brewer.pal() function to extract three qualitatively different colors from a larger palette, for us to use manually. The colors are represented as hex values. We use the :: convention (as we have done with labellers from the scales library) to use the function without loading the whole library:

model_colors <- RColorBrewer::brewer.pal(3, "Set1")

## [1] "#E41A1C" "#377EB8" "#4DAF4A"

Then we create a plot with three different smoothers, mapping the color and fill within the aes() function as the name of the smoother:

p0 <- ggplot(data = gapminder,
            mapping = aes(x = log(gdpPercap), y = lifeExp))

p1 <- p0 + geom_point(alpha = 0.2) +
    geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS")) +
    geom_smooth(method = "lm", formula = y ~ splines::bs(x, df = 3),
                aes(color = "Cubic Spline", fill = "Cubic Spline")) +
    geom_smooth(method = "loess",
                aes(color = "LOESS", fill = "LOESS"))

p1 + scale_color_manual(name = "Models", values = model_colors) +
    scale_fill_manual(name = "Models", values = model_colors) +
    theme(legend.position = "top")

Figure 6.2: Fitting smoothers with a legend.

Fitting smoothers with a legend.

In effect, we have cheated a little here to make the plot work. Until now, we have always mapped aesthetics to the names of variables, not to strings like “OLS” or “Cubic Splines”. In Chapter 3, when we discussed mapping versus setting aesthetics, we saw what happened when we tried to change the color of the points in a scatterplot by setting them to “purple” inside the aes() function. The result was that the points turned red instead, as ggplot in effect created a new variable and labeled it with the word “purple”. We learned there that the aes() function was for mapping variables to aesthetics.

Here we take advantage of that behavior, creating a new single-value variable for the name of each of our models. Ggplot is clever enough to properly construct the relevant guide if we call scale_color_manual() and scale_fill_manual().Remember that we have to call two scale functions because we have two mappings. The result is a single plot containing not just our three smoothers, but also an appropriate legend to guide the reader.

These model-fitting features make ggplot very useful for exploratory work, and make it straightforward to generate and compare model-based trends and other summaries as part of the process of descriptive data visualization. The various stat_ functions are a flexible way to add summary estimates of various kinds to plots. But we will also want more than this, including presenting results from models we fit ourselves.

6.2 Look Inside Model Objects

Covering the details of modeling data in R is beyond the scope of this book. But we will discuss some ways to take the models that you fit and extract information that is easy to work with in ggplot. We can start by learning a little more about how the output of models is stored in R. Remember, we are always working with objects, and objects have an internal structure consisting of named pieces. Sometimes these are single numbers, sometimes vectors, and sometimes lists of things like vectors, matrices, or formulas.

We have been working extensively with data frames. These store a table of data with named columns, perhaps consisting of different classes of variable, such as integers, characters, or factors. The tidyverse’s tibble class is much like a data frame, but has stronger views about how certain types of data should be stored. Model objects are a little more complicated again.


## # A tibble: 1,704 x 6
##        country continent  year lifeExp      pop gdpPercap
##         <fctr>    <fctr> <int>   <dbl>    <int>     <dbl>
##  1 Afghanistan      Asia  1952    28.8  8425333       779
##  2 Afghanistan      Asia  1957    30.3  9240934       821
##  3 Afghanistan      Asia  1962    32.0 10267083       853
##  4 Afghanistan      Asia  1967    34.0 11537966       836
##  5 Afghanistan      Asia  1972    36.1 13079460       740
##  6 Afghanistan      Asia  1977    38.4 14880372       786
##  7 Afghanistan      Asia  1982    39.9 12881816       978
##  8 Afghanistan      Asia  1987    40.8 13867957       852
##  9 Afghanistan      Asia  1992    41.7 16317921       649
## 10 Afghanistan      Asia  1997    41.8 22227415       635
## # ... with 1,694 more rows

Recall from Section 1.4.8 that we can use the str() function to learn more about the internal structure of any object. For example, we can get some information on what class (or classes) of object gapminder is, how large it is, and what components it has. The output from str(gapminder) is somewhat dense:

## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3
##    ...
## $ year : int 1952 1957 ...
## $ lifeExp : num 28.8 ...
## $ pop : int 8425333 9240934 ...
## $ gdpPercap: num 779 ...

There is a lot of information here about the object as a whole and each variable in it. In the same way, statistical models in R have an internal structure. But because models are more complex entities than data tables, their structure is correspondingly more complicated. There are more pieces of information, and more kinds of information, that we might want to use. All of this information is generally stored in or is computable from parts of a model object.

We can create a linear model, an ordinary OLS regression, using the gapminder data. This dataset has a country-year structure that makes an OLS specification like this the wrong one to use. But never mind that for now. We use the lm() function to run the model, and store it in an object called out:

out <- lm(formula = lifeExp ~ gdpPercap + pop + continent,
          data = gapminder)

Schematic view of a linear model object. Figure 6.3: Schematic view of a linear model object.

The first argument is the formula for the model. lifeExp is the dependent variable and the tilde ~ operator is used to designate the left- and right-hand sides of a model (including in cases, as we saw with facet_wrap() where the model just has a right-hand side.)

Let’s look at the results by asking R to print a summary of the model.


## Call:
## lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.16  -4.49   0.30   5.11  25.17 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.78e+01   3.40e-01  140.82   <2e-16 ***
## gdpPercap         4.50e-04   2.35e-05   19.16   <2e-16 ***
## pop               6.57e-09   1.98e-09    3.33    9e-04 ***
## continentAmericas 1.35e+01   6.00e-01   22.46   <2e-16 ***
## continentAsia     8.19e+00   5.71e-01   14.34   <2e-16 ***
## continentEurope   1.75e+01   6.25e-01   27.97   <2e-16 ***
## continentOceania  1.81e+01   1.78e+00   10.15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 8.37 on 1697 degrees of freedom
## Multiple R-squared:  0.582,  Adjusted R-squared:  0.581 
## F-statistic:  394 on 6 and 1697 DF,  p-value: <2e-16

When we use the summary() function on out, we are not getting a simple feed of what’s in the model object. Instead, like any function, summary() takes its input, performs some actions, and produces output. In this case, what is printed to the console is partly information that is stored inside the model object, and partly information that the summary() function has calculated and formated for display on the screen. Behind the scenes, summary() gets help from other functions. Objects of different classes have default methods associated with them, so that when the generic summary() function is applied to a linear model object, the function knows to pass the work on to a more specialized function that does a bunch of calculations and formatting appropriate to a linear model object. We use the same generic summary() function on data frames, as in summary(gapminder), but in that case a different default method is applied.

The output from summary() gives a precis of the model, but we can’t really do any further analysis with it directly. For example, what if we want to plot something from the model? The information necessary to make plots is inside the out object, but it is not obvious how to use it. If we take a look at the structure of the model object we can see that there is a lot of information in there. It is organized as a list of components or elements. Several of these elements are themselves lists. Figure 6.3 gives you a schematic view of the contents of a linear model object. As you can see, it is a list of items, some of which are single values, some nested lists of values, and some data frames. Again, remember our discussion in Section 1.4.8, where we said you can think of objects as being organized like a filing system: cabinets contain folders, any of which may contain pages of information, or even whole documents. Alternatively, sticking with the list analogy, think of a master to-do list for a project, where your to-do list might have items that themselves contain additional lists of tasks.

The out object created by lm contains several different kinds of named element. Some, like the residual degrees of freedom in the model, are just a single number.Try out$df.residual at the console. Others are much larger entities, such as the data frame used to fit the model, which is retained by default. Try out$model, but be prepared for a lot of stuff to be printed at the console. Other elements have been computed by R and then stored, such as the coefficients of the model and other quantities—try out$coefficients, out$residuals, and out$fitted.values, for instance. Others are lists themselves (like qr). So you can see that the summary() function is selecting and printing only a small amount of core information, in comparison to what is stored in the model object.

Just like the tables of data we saw earlier in Section A.2, the output of summary() is presented in a way that is compact and efficient in terms of getting information across, but also untidy when considered from the point of view of further manipulation. There is a table of coefficients, but the variable names are in the rows. The column names are awkward, and some information (e.g. at the bottom of the output) has been calculated and printed out, but is not stored in the model object.

6.3 Get Model-based Graphics Right

Figures based on statistical models face all the ordinary challenges of effective data visualization, and then some. This is because model results usually carry a considerable extra burden of interpretation and necessary background knowledge. The more complex the model, the trickier it becomes to convey this information effectively, and the easier it becomes to lead one’s audience or oneself into error. Within the social sciences, our ability to clearly and honestly present model-based graphics has greatly improved over the past ten or fifteen years. Over the same period, there has also been a much sharper understanding of just how tricky some models are to understand, including ones—such as interaction effects, logistic regression, and especially the combination of the two—that had previously been seen as a straightforward element in the modeling toolkit (Ai and Norton 2003; Brambor, Clark, and Golder 2006).

Plotting model estimates is closely connected to properly estimating models in the first place. This means there is no substitute for learning the statistics. You should not use graphical methods as a substitute for understanding the model used to produce them. While this book cannot teach you that material, we can make a few general points about what good model-based graphics look like, and work through some examples of how ggplot and some additional libraries can make it easier to get good results.

6.3.1 Present your findings in substantive terms

Useful model-based plots show results in ways that are substantively meaningful and directly interpretable with respect to the questions the analysis is trying to answer. Most simply, this means showing results in a context where other variables in the analysis are held at sensible values, such as their means or medians. With continuous variables, it can often be useful to generate predicted values that cover some substantively meaningful move across the distribution, such as from the 25th to the 75th percentile. For categorical variables, predicted values might be presented with respect to the modal category in the data, or for a particular category of theoretical interest. Relatedly, presenting substantively interpretable findings often also means using (and sometimes converting to) a scale that readers can easily understand. If your model reports results in log-odds, for example, maybe converting to predicted probabilities will make it easier to see the meaning of the results. Note that each of these points would apply equally well to the presentation of summary results in a table rather than a graph. There is nothing distinctively graphical about putting the focus on the substantive meaning of your findings.

6.3.2 Show your degree of confidence

Much the same applies to presenting the degree of uncertainty or confidence you have in your results. Model estimates come with various measures of precision, confidence, credence, or significance. Presenting and interpreting these measures is notoriously prone to misinterpretation, or over-interpretation, as researchers and audiences both demand more from things like confidence intervals and p-values than these statistics can deliver. At a minimum, having decided on an appropriate measure of model fit or the right assessment of confidence, you should show their range when you present your results. A family of related ggplot geoms allow you to show a range or interval defined by position on the x-axis and then a ymin and ymax range on the y-axis. These geoms include geom_pointrange() and geom_errorbar(), which we will see in action shortly. A related geom, geom_ribbon() uses the same arguments to draw filled areas, and is useful for plotting ranges of y-axis values along some continuously varying x-axis.

6.3.3 Show your data when you can

Plotting the results from a multivariate model generally means one of two things. First, we can show what is in effect a table of coefficients with associated measures of confidence, perhaps organizing the coefficients into meaningful groups, or by the size of the predicted association, or both. Second, we can show the predicted values of some variables (rather than just a model’s coefficients) across some range of interest. The latter approach lets us show the original data points if we wish. The way ggplot builds graphics layer by layer allows us to easily combine model estimates (e.g. a regression line and an associated range) and the underlying data. In effect these are manually-constructed versions of the automatically-generated plots that we have been producing with geom_smooth() since the beginning of this book.

6.4 Generate Predictions to Graph

Having fitted a model, then, we might want to get a picture of the estimates it produces over the range of some particular variable, holding other covariates constant at some sensible values. The predict() function is a generic way of using model objects to produce this kind of prediction. In R, “generic” functions take their inputs and pass them along to more specific functions behind the scenes, ones that are suited to working with the particular kind of model object we have. The details of getting predicted values from a OLS model, for instance, will be somewhat different from getting predictions out of a logistic regression. But in each case we can use the same predict() function, taking care to check the documentation to see what form the results are returned in for the kind of model we are working with. Many of the most commonly-used functions in R are generic in this way. The summary() function, for example, works on objects of many different classes, from vectors to data frames and statistical models, producing appropriate output in each case by way of a class-specific function in the background.

For predict() to calculate the new values for us, it needs some new data to fit the model to. We will generate a new data frame whose columns have the same names as the variables in the model’s original data, but where the rows have new values. A very useful function called expand.grid() will help us do this. We will give it a list of variables, specifying the range of values we want each variable to take. Then expand.grid() will generate the will multiply out theThe function calculates the cartesian product of the variables given to it. full range of values for all combinations of the values we give it, thus creating a new data frame with the new data we need.

In the following bit of code, we use min() and max() to get the minimum and maximum values for per capita GDP, and then create a vector with one hundred evenly-spaced elements between the minimum and the maximum. We hold population constant at its median, and we let continent take all of its five available values.

min_gdp <- min(gapminder$gdpPercap)
max_gdp <- max(gapminder$gdpPercap)

med_pop <- median(gapminder$pop)

pred_df <- expand.grid(gdpPercap = (seq(from = min_gdp,
                                        to = max_gdp,
                                        length.out = 100)),
                       pop = med_pop,
                       continent = c("Africa", "Americas",
                                     "Asia", "Europe", "Oceania"))


## [1] 500   3


##   gdpPercap     pop continent
## 1   241.166 7023596    Africa
## 2  1385.428 7023596    Africa
## 3  2529.690 7023596    Africa
## 4  3673.953 7023596    Africa
## 5  4818.215 7023596    Africa
## 6  5962.477 7023596    Africa

Now we can use predict(). If we give the function our new data and model, without any further argument, it will calculate the fitted values for every row in the data frame. If we specify interval = 'predict' as an argument, it will calculate 95% prediction intervals in addition to the point estimate.

pred_out <- predict(object = out,
                    newdata = pred_df,
                    interval = "predict")


##       fit     lwr     upr
## 1 47.9686 31.5477 64.3895
## 2 48.4830 32.0623 64.9037
## 3 48.9973 32.5767 65.4180
## 4 49.5117 33.0909 65.9325
## 5 50.0260 33.6050 66.4471
## 6 50.5404 34.1189 66.9619

Because we know that, by construction, the cases in pred_df and pred_out correspond row for row, we can simply bind the two data frames together by column. Note that this method of joining or merging tables is definitely not recommended when you are dealing with data.

pred_df <- cbind(pred_df, pred_out)


##   gdpPercap     pop continent  fit  lwr  upr
## 1       241 7023596    Africa 48.0 31.5 64.4
## 2      1385 7023596    Africa 48.5 32.1 64.9
## 3      2530 7023596    Africa 49.0 32.6 65.4
## 4      3674 7023596    Africa 49.5 33.1 65.9
## 5      4818 7023596    Africa 50.0 33.6 66.4
## 6      5962 7023596    Africa 50.5 34.1 67.0

The end result is a tidy data frame, containing the predicted values from the model for the range of values we specified. Now we can plot the results. Because we produced a full range of predicted values, we can decide whether or not to use all of them. Here we further subset the predictions to just those for Europe and Africa.

OLS Predictions. Figure 6.4: OLS Predictions.

p <- ggplot(data = subset(pred_df, continent %in% c("Europe", "Africa")),
            aes(x = gdpPercap,
                y = fit,
                ymin = lwr,
                ymax = upr,
                color = continent,
                fill = continent,
                group = continent))

p + geom_point(data = subset(gapminder,
                             continent %in% c("Europe", "Africa")),
               aes(x = gdpPercap,
                   y = lifeExp,
                   color = continent),
               alpha = 0.5,
               inherit.aes = FALSE) + 
    geom_line() +
    geom_ribbon(alpha = 0.2, color = FALSE) +
    scale_x_log10(labels = scales::dollar)

We use a new geom here to draw the area covered by the prediction intervals: geom_ribbon(). It takes an x argument like a line, but a ymin and ymax argument as specified in the ggplot() aesthetic mapping. This defines the lower and upper limits of the prediction interval.

In practice, you may not use predict() directly all that often. Instead, you might write code using additional libraries that encapsulate the process of producing predictions and plots from models. These are especially useful when your model is a little more complex and the interpretation of coefficients becomes trickier. This happens, for instance, when you have a binary outcome variable and need to convert the results of a logistic regression into predicted probabilities, or when you have interaction terms amongst your predictions. We will discuss some of these helper libraries in the next few sections. However, bear in mind that predict() and its ability to work with different classes of model underpins many of those libraries. So it’s useful to see it in action first hand in order to understand what it is doing.

6.5 Tidy Model Objects with Broom

The predict method is very useful, but there are a lot of other things we might want to do with our model output. We will use David Robinson’s broom package to help us out. It is a library of functions that help us get from the model results that R generates to numbers that we can plot. It will take model objects and turn pieces of them into data frames that you can use easily with ggplot.


Broom takes ggplot’s approach to tidy data and extends it to the model objects that R produces. Its methods can tidily extract three kinds of information. First, we can see component-level information about aspects of the model itself, such as coefficients and t-statistics. Second, we can obtain observation-level information about the model’s connection to the underlying data. This includes the fitted values and residuals for each observation in the data. And finally we can get model-level information that summarizes the fit as a whole, such as an F-statistic, the model deviance, or the r-squared. There is a broom function for each of these tasks.

6.5.1 Get component-level statistics with tidy()

The tidy() function takes a model object and returns a data frame of component-level information. We can work with this to make plots in a familiar way, and much more easily than fishing inside the model object to extract the various terms. Here is a basic example, using the default results as just returned. For a more convenient display of the results, we will pipe the object we create with tidy() through a function that rounds the numeric columns of the data frame to two decimal places. This doesn’t change anything about the object itself, of course.

out_comp <- tidy(out)

out_comp %>% round_df()

##                term estimate std.error statistic p.value
## 1       (Intercept)    47.81      0.34    140.82       0
## 2         gdpPercap     0.00      0.00     19.16       0
## 3               pop     0.00      0.00      3.33       0
## 4 continentAmericas    13.48      0.60     22.46       0
## 5     continentAsia     8.19      0.57     14.34       0
## 6   continentEurope    17.47      0.62     27.97       0
## 7  continentOceania    18.08      1.78     10.15       0

We are now able to treat this data frame just like all the other data that we have seen so far.

Basic plot of OLS estimates. Figure 6.5: Basic plot of OLS estimates.

p <- ggplot(out_comp, mapping = aes(x = term,
                                    y = estimate))

p + geom_point() + coord_flip() 

We can extend and clean up this plot in a variety of ways. For example, we can tell tidy() to calculate confidence intervals for the estimates, using R’s confint() function.

out_conf <- tidy(out, conf.int = TRUE)
out_conf %>% round_df()

##                term estimate std.error statistic p.value conf.low conf.high
## 1       (Intercept)    47.81      0.34    140.82       0    47.15     48.48
## 2         gdpPercap     0.00      0.00     19.16       0     0.00      0.00
## 3               pop     0.00      0.00      3.33       0     0.00      0.00
## 4 continentAmericas    13.48      0.60     22.46       0    12.30     14.65
## 5     continentAsia     8.19      0.57     14.34       0     7.07      9.31
## 6   continentEurope    17.47      0.62     27.97       0    16.25     18.70
## 7  continentOceania    18.08      1.78     10.15       0    14.59     21.58

The convenience “not in” operator %nin% is available via the socviz library. It does the opposite of %in% and selects only the items in a first vector of characters that are not in the second. We’ll use it to drop the intercept term from the table. We also want to something about the labels. When fitting a model with categorical variables, R will create coefficient names based on the variable name and the category name, like continentAmericas. Normally we like to clean these up before plotting. Most commonly, we just want to strip away the variable name at the beginning of the coefficient label. For this we can use prefix_strip(), a convenience function in the socviz library. We tell it which prefixes to drop, using it to create a new column variable in out_conf that corresponds to the terms column, but that has nicer labels.

out_conf <- subset(out_conf, term %nin% "(Intercept)")
out_conf$nicelabs <- prefix_strip(out_conf$term, "continent")

Now we can use geom_pointrange() to make a figure that displays some information about our confidence in the variable estimates, as opposed to just the coefficients. As with the boxplots earlier, we use reorder() to sort the names of the model’s terms by the estimate variable, thus arranging our plot of effects from largest to smallest in magnitude.

A nicer plot of OLS estimates and confidence intervals. Figure 6.6: A nicer plot of OLS estimates and confidence intervals.

p <- ggplot(out_conf, mapping = aes(x = reorder(nicelabs, estimate),
                                    y = estimate, ymin = conf.low, ymax = conf.high))

p + geom_pointrange() + coord_flip() + labs(x="", y="OLS Estimate")

Notice also that dotplots of this kind can be very compact. The vertical axis can often be compressed quite a bit, with no loss in comprehension. In fact, they are often easier to read with much less room between the rows than given by a default square shape.

6.5.2 Get observation-level statistics with augment()

The values returned by augment() are all statistics calculated at the level of the original observations. As such, they can be added on to the data frame that the model is based on. Working from a call to augment() will return a data frame with all the original observations used in the estimation of the model, together with columns like the following:

  • .fitted — The fitted values of the model.
  • .se.fit — The standard errors of the fitted values.
  • .resid — The residuals.
  • .hat — The diagonal of the hat matrix.
  • .sigma — An estimate of residual standard deviation when the corresponding observation is dropped from the model.
  • .cooksd — Cook’s distance, a common regression diagnostic; and
  • .std.resid — The standardized residuals.

Each of these variables is named with a leading dot, for example .hat rather than hat, and so on. This is to guard against accidentally confusing it with (or accidentally overwriting) an existing variable in your data with this name. The columns of values return will differ slightly depending on the class of model being fitted.

out_aug <- augment(out)
head(out_aug) %>% round_df()

##   lifeExp gdpPercap      pop continent .fitted .se.fit
## 1   28.80    779.45  8425333      Asia   56.41    0.47
## 2   30.33    820.85  9240934      Asia   56.44    0.47
## 3   32.00    853.10 10267083      Asia   56.46    0.47
## 4   34.02    836.20 11537966      Asia   56.46    0.47
## 5   36.09    739.98 13079460      Asia   56.43    0.47
## 6   38.44    786.11 14880372      Asia   56.46    0.47
##   .resid .hat .sigma .cooksd .std.resid
## 1 -27.61    0   8.34    0.01      -3.31
## 2 -26.10    0   8.34    0.00      -3.13
## 3 -24.46    0   8.35    0.00      -2.93
## 4 -22.44    0   8.35    0.00      -2.69
## 5 -20.34    0   8.35    0.00      -2.44
## 6 -18.02    0   8.36    0.00      -2.16

By default, augment() will extract the available data from the model object. This will usually include the variables used in the model itself, but not any additional ones contained in the original data frame. Sometimes it is useful to have these. We can add them by specifying the data argument:

out_aug <- augment(out, data = gapminder)
head(out_aug) %>% round_df()

##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952   28.80  8425333    779.45
## 2 Afghanistan      Asia 1957   30.33  9240934    820.85
## 3 Afghanistan      Asia 1962   32.00 10267083    853.10
## 4 Afghanistan      Asia 1967   34.02 11537966    836.20
## 5 Afghanistan      Asia 1972   36.09 13079460    739.98
## 6 Afghanistan      Asia 1977   38.44 14880372    786.11
##   .fitted .se.fit .resid .hat .sigma .cooksd .std.resid
## 1   56.41    0.47 -27.61    0   8.34    0.01      -3.31
## 2   56.44    0.47 -26.10    0   8.34    0.00      -3.13
## 3   56.46    0.47 -24.46    0   8.35    0.00      -2.93
## 4   56.46    0.47 -22.44    0   8.35    0.00      -2.69
## 5   56.43    0.47 -20.34    0   8.35    0.00      -2.44
## 6   56.46    0.47 -18.02    0   8.36    0.00      -2.16

Note that if cases had to be dropped from the original data set due to missing data on some variables, these will not be carried over to the augmented data frame.

The new columns created by augment() can be used to create some standard regression plots. For example, we can plot the fitted values versus the residuals. Figure 6.7 suggests, unsurprisingly, that our country-year data has rather more structure than is captured by our OLS model.

Fitted values vs Residuals. Figure 6.7: Fitted values vs Residuals.

p <- ggplot(data = out_aug,
            mapping = aes(x = .fitted, y = .resid))
p + geom_point()

6.5.3 Get model-level statistics with glance()

This function organizes the information typically presented at the bottom of a model’s summary() output. By itself, it usually just returns a table with a single row in it. But as we shall see in a moment, the real power of brooom’s approach is the way that it can scale up to cases where we are grouping or subsampling our data.

glance(out) %>% round_df()

##   r.squared adj.r.squared sigma statistic p.value df
## 1      0.58          0.58  8.37    393.91       0  7
##     logLik     AIC     BIC deviance df.residual
## 1 -6033.83 12083.6 12127.2   118754        1697

Broom is able to tidy (and augment, and glance at) a wide range of model types. Not all functions are available for all classes of model—you can consult broom’s documentation for more details on what is available. For example, here is a plot created from the tidied output of an event-history analysis. First we generate a Cox proportional hazards model of some survival data.


out_cph <- coxph(Surv(time, status) ~ age + sex, data = lung)
out_surv <- survfit(out_cph)

The details of the fit are not important here, but in the first step the Surv() function creates the response or outcome variable for the proportional hazards model that is then fitted by the coxph() function. Then the survfit() function creates the survival curve from the model, much like we used predict() to generate predicted values earlier. Try summary(out_cph) to see the model, and summary(out_surv) to see the table of predicted values that will form the basis for our plot. Next we tidy out_surv to get a data frame, and plot it.

A Kaplan-Meier plot. Figure 6.8: A Kaplan-Meier plot.

out_tidy <- tidy(out_surv)

p <- ggplot(data = out_tidy, mapping = aes(time, estimate))

p + geom_line() +
    geom_ribbon(mapping = aes(ymin = conf.low, ymax = conf.high), alpha = .2)

6.6 Grouped Analysis with Broom

Broom makes it possible to quickly fit models to different subsets of your data and get consistent and usable tables of results out the other end. For example, let’s say we wanted to look at the gapminder data by examining the relationship between life expectancy and GDP by continent, for each year in the data.

The gapminder data is at bottom organized by country-years—that is the basic unit of observation in the rows. If we wanted, we could take a slice of the data manually, such as “all countries observed in Asia, in 1962” or “all in Africa, 2002”. Here is “Europe, 1977”:

eu77 <- gapminder %>% filter(continent == "Europe", year == 1977)


## # A tibble: 30 x 6
##                   country continent  year lifeExp      pop
##                    <fctr>    <fctr> <int>   <dbl>    <int>
##  1                Albania    Europe  1977   68.93  2509048
##  2                Austria    Europe  1977   72.17  7568430
##  3                Belgium    Europe  1977   72.80  9821800
##  4 Bosnia and Herzegovina    Europe  1977   69.86  4086000
##  5               Bulgaria    Europe  1977   70.81  8797022
##  6                Croatia    Europe  1977   70.64  4318673
##  7         Czech Republic    Europe  1977   70.71 10161915
##  8                Denmark    Europe  1977   74.69  5088419
##  9                Finland    Europe  1977   72.52  4738902
## 10                 France    Europe  1977   73.83 53165019
## # ... with 20 more rows, and 1 more variables:
## #   gdpPercap <dbl>

We could then see what the relationship between life expectancy and GDP looked like for that continent-year group:

fit <- lm(lifeExp ~ log(gdpPercap), data = eu77)


## Call:
## lm(formula = lifeExp ~ log(gdpPercap), data = eu77)
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.496 -1.031  0.093  1.176  3.712 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      29.489      7.161    4.12  0.00031 ***
## log(gdpPercap)    4.488      0.756    5.94  2.2e-06 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 2.11 on 28 degrees of freedom
## Multiple R-squared:  0.557,  Adjusted R-squared:  0.541 
## F-statistic: 35.2 on 1 and 28 DF,  p-value: 2.17e-06

With dplyr and broom we can do this for every continent-year slice of the data in a compact and tidy way. We start our table of data, then (%>%) group the countries by continent and year using the group_by() function. The call group_by(continent, year) reorganizes the data. Previously, the country-year observations were organized alphabetically by country, and within country by year. Now the table is reshuffled so that the observations are organized first by continent, and within continent by year.

With the data grouped in this way, we pass it along the pipeline and fit some models. We create the fit object inside the pipeline. The do() function takes care of that again, instructing R to create the fit model for every instance of the continent-year grouping.

out_le <- gapminder %>% group_by(continent, year) %>%
    do(fit = lm(lifeExp ~ gdpPercap, data = .))


## Source: local data frame [60 x 3]
## Groups: <by row>
## # A tibble: 60 x 3
##    continent  year      fit
##  *    <fctr> <int>   <list>
##  1    Africa  1952 <S3: lm>
##  2    Africa  1957 <S3: lm>
##  3    Africa  1962 <S3: lm>
##  4    Africa  1967 <S3: lm>
##  5    Africa  1972 <S3: lm>
##  6    Africa  1977 <S3: lm>
##  7    Africa  1982 <S3: lm>
##  8    Africa  1987 <S3: lm>
##  9    Africa  1992 <S3: lm>
## 10    Africa  1997 <S3: lm>
## # ... with 50 more rows

In this pipeline, the “.” mark is shorthand for “the thing we are computing on just now”, or “the thing we inherited from the operation that just finished on the left-hand-side of the pipeline”. Most often this is the data in some form—in this case, the subset we are fitting the model to.

The resulting object is has the tabular form we expect—it is a tibble—but it looks a little unusual. The first two columns are the familiar continent and year. As you can see, we have a row for each year in the data, and the years are grouped within continents. The third column, fit, is not like the others. It is represented as a row in the table, but it is in fact an lm object—that is, the full output of a linear regression. This way of storing output is called a list column. What we have done is fit sixty OLS models, one for every group of countries within each continent-year grouping. Our “Europe 1977” fit is in there:

out_le %>% filter(continent == "Europe" & year == 1977) 

## Source: local data frame [1 x 3]
## Groups: <by row>
## # A tibble: 1 x 3
##   continent  year      fit
##      <fctr> <int>   <list>
## 1    Europe  1977 <S3: lm>

We will not work with the contents of the list column directly. It is more like an intermediate storage container. Instead, we tidy the output some more. Again, we use a series of pipes to do it. Here, start with our out_le object and then tidy every fit in it, which works in just the same way that tidying a single fit would work. We filter out all the Intercept terms and also all the observations from Oceania. In the case of the Intercepts we do this just out of convenience. In Oceania’s case there are very few observations from that continent. We put the results in an object called out_tidy.

out_tidy <- out_le %>% tidy(fit) %>%
    filter(term %nin% "(Intercept)" &
           continent %nin% "Oceania")

We now have tidy regression output with an estimate of GDP per capita for each year, within continents. Now we can plot these estimates in a way that takes advantage of their groupiness.

p <- ggplot(data = out_tidy,
            mapping = aes(x = year, y = estimate,
                          ymin = estimate - 2*std.error,
                          ymax = estimate + 2*std.error,
                          group = continent, color = continent))

p + geom_pointrange(position = position_dodge(width = 1)) +
    scale_x_continuous(breaks = unique(gapminder$year)) + 
    theme(legend.position = "top") +
    labs(x = "Year", y = "Estimate", color = "Continent")

Figure 6.9: Yearly estimates of the association between GDP and Life Expectancy, pooled by continent.

Yearly estimates of the association between GDP and Life Expectancy, pooled by continent.

Note the use of the position_dodge() function here to set the position of the pointranges. We could have faceted the results by continent, but doing it this way lets us see the yearly estimates right next to each other. This technique is very useful not just for cases like this, but also when you want to compare the coefficients given by different kinds of statistical model. This sometimes happens when we’re interested in seeing how, say, OLS performs against some fancier specification. The position_dodge() option lets you place the point-estimates side by side.

6.6.1 Broom’s Approach Scales Nicely

When you think of output from tidy(), augment(), or glance() as forming a data frame with consistent column names, a natural extension is to exercises in repeated measures. Grouped analysis is one way to go, as we have just seen. Another application is to plots that effectively summarize, for example, a bootstrapping exercise with many replicates. Here we follow and partly adapt an example from David Robinson’s discussion in broom’s documentation, illustrating the sort of thing that you can do once we take advantage of broom’s ability to streamline the management of a lot of model output within a pipeline. The following chunk of code fits a non-linear least squares model to life expectancy regressed on per capita GDP.

out_nls <- nls(lifeExp ~ k / log(gdpPercap) + b,
               data = gapminder, start=list(k=1, b=0))

## Formula: lifeExp ~ k/log(gdpPercap) + b
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## k  -533.48       9.60   -55.5   <2e-16 ***
## b   126.42       1.22   103.7   <2e-16 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Residual standard error: 7.7 on 1702 degrees of freedom
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 3.44e-09

A nonlinear least squares fit of life expectancy to per capita gdp. Figure 6.10: A nonlinear least squares fit of life expectancy to per capita gdp.

p <- ggplot(data = gapminder, mapping = aes(x = log(gdpPercap), y = lifeExp))

p + geom_point(alpha=0.1) + geom_line(aes(y=predict(out_nls)))

Next, we can do the same again, but this time generate a hundred bootstrap replicates of the data, and fit the model to each one. We then use tidy() to summarize the coefficients from each model.

out_bootnls <- gapminder %>% bootstrap(100) %>%
    do(tidy(nls(lifeExp ~ k / log(gdpPercap) + b, .,
                start = list(k = 1, b = 0))))


## # A tibble: 200 x 6
## # Groups:   replicate [100]
##    replicate  term estimate std.error statistic p.value
##        <int> <chr>    <dbl>     <dbl>     <dbl>   <dbl>
##  1         1     k -531.317   9.51458  -55.8424       0
##  2         1     b  126.535   1.19983  105.4612       0
##  3         2     k -543.791   9.12949  -59.5643       0
##  4         2     b  127.740   1.16461  109.6848       0
##  5         3     k -557.662   9.88955  -56.3891       0
##  6         3     b  128.741   1.25204  102.8248       0
##  7         4     k -536.640   9.66840  -55.5045       0
##  8         4     b  126.991   1.22768  103.4396       0
##  9         5     k -548.285   9.40235  -58.3135       0
## 10         5     b  128.287   1.19643  107.2255       0
## # ... with 190 more rows

We can now look at the distribution of the model’s b and k parameters from the bootstrapped dataset.

Histograms of the bootstrap estimates of the b and k parameters. Figure 6.11: Histograms of the bootstrap estimates of the b and k parameters.

p <- ggplot(out_bootnls, mapping = aes(estimate))
p + geom_histogram(binwidth = 2) + facet_wrap(~ term, scales = "free")

We can also make a plot showing the fitted line from each of the bootstrap replicates. This time we use augment() instead of tidy() because we want observation-level information at the end of the process.

out_bootnls_aug <- gapminder %>% bootstrap(100) %>%
    do(augment(nls(lifeExp ~ k / log(gdpPercap) + b, .,
                   start=list(k=1, b=0)), .))


## # A tibble: 170,400 x 9
## # Groups:   replicate [100]
##    replicate           country continent  year lifeExp
##        <int>            <fctr>    <fctr> <int>   <dbl>
##  1         1        Costa Rica  Americas  1997  77.260
##  2         1            Zambia    Africa  2002  39.193
##  3         1             Haiti  Americas  1967  46.243
##  4         1             Kenya    Africa  1982  58.766
##  5         1           Denmark    Europe  1957  71.810
##  6         1 Equatorial Guinea    Africa  1957  35.983
##  7         1          Botswana    Africa  1992  62.745
##  8         1            Israel      Asia  1972  71.630
##  9         1      Saudi Arabia      Asia  1962  45.914
## 10         1           Reunion    Africa  1982  69.885
## # ... with 170,390 more rows, and 4 more variables:
## #   pop <int>, gdpPercap <dbl>, .fitted <dbl>, .resid <dbl>

Plotting 100 bootstrapped NLS fits. Figure 6.12: Plotting 100 bootstrapped NLS fits.

p <-  ggplot(data = gapminder,
             mapping = aes(x = log(gdpPercap), y = lifeExp))

p + geom_point(alpha = 0.1) +
    geom_line(data = out_bootnls_aug,
              mapping = aes(y = .fitted, group = replicate),
              alpha = 0.1)

In the earlier examples we took the data, grouped it, and then fitted a model. Or we took the data, bootstrapped it, and then fitted models to each of the bootstrap replicates. As a final example, we can combine these operations, grouping the data first and then bootstrapping within our groups.

bootspline_aug <- gapminder %>% group_by(continent) %>%
    bootstrap(100, by_group = TRUE) %>%
    do(augment(nls(lifeExp ~ k / log(gdpPercap) + b, .,
                   start = list(k = 1, b = 0)), .))

p <-  ggplot(data = gapminder,
             mapping = aes(x = log(gdpPercap),
                           y = lifeExp, color = continent))

p + geom_point(alpha = 0.05, color = "black") +
    geom_line(data = bootspline_aug,
              mapping = aes(y = .fitted,
                            group = replicate), alpha = 0.05) +
    facet_wrap(~ continent, nrow = 1) +
    guides(fill = FALSE, color = FALSE)

The only addition we needed to make to the code was an explicit instruction to bootstrap() that we want the replicates to be drawn within continents. Hence the argument by_group = TRUE.

NLS model fit to 100 bootstrap replicates of the gapminder data by continent.

Figure 6.13: NLS model fit to 100 bootstrap replicates of the gapminder data by continent.

6.7 Plot Marginal Effects

Our earlier discussion of predict() was about obtaining estimates of the average effect of some coefficient, net of the other terms in the model. Over the past decade, estimating and plotting partial or marginal effects from a model has become an increasingly common way of presenting accurate and interpretively useful predictions. Interest in marginal effects plots was stimulated by the realization that the interpretation of terms in logistic regression models, in particular, was trickier than it seemed—especially when there were interaction terms in the model (Ai and Norton 2003). Thomas Leeper’s margins package can make these plots for us.


To see it in action, we’ll take another look at the General Social Survey data in gss_sm, this time focusing on the binary variable, obama. It is coded 1 if the respondent said they voted for Barack Obama in the 2012 presidential election, and 0 otherwise.As is common with retrospective questions on elections, rather more people claim to have voted for Obama than is consistent with the vote share he received in the election. In this case, mostly for convenience here, the zero code includes all other answers to the question, including those who said they voted for Mitt Romney, those who said they did not vote, those who refused to answer, and those who said they didn’t know who they voted for. We will fit a logistic regression on obama, with age, polviews, race, and sex as the predictors. The age variable is the respondent’s age in years. The sex variable is coded as “Male” or “Female” with “Male” as the reference category. The race variable is coded as “White”, “Black”, or “Other” with “White” as the reference category. The polviews measure is a self-reported scale of the respondent’s political orientation from “Extremely Conservative” through “Extremely Liberal”, with “Moderate” in the middle. We take polviews and create a new variable, polviews_m, using the relevel() function to recode “Moderate” to be the reference category. We fit the model with the glm() function, and specify an interaction between race and sex.

gss_sm$polviews_m <- relevel(gss_sm$polviews, ref = "Moderate")

out_bo <- glm(obama ~ polviews_m + sex*race,
              family = "binomial", data = gss_sm)


## Call:
## glm(formula = obama ~ polviews_m + sex * race, family = "binomial", 
##     data = gss_sm)
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.905  -0.554   0.177   0.542   2.244  
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                       0.29649    0.13409    2.21   0.0270 *  
## polviews_mExtremely Liberal       2.37295    0.52504    4.52  6.2e-06 ***
## polviews_mLiberal                 2.60003    0.35667    7.29  3.1e-13 ***
## polviews_mSlightly Liberal        1.29317    0.24843    5.21  1.9e-07 ***
## polviews_mSlightly Conservative  -1.35528    0.18129   -7.48  7.7e-14 ***
## polviews_mConservative           -2.34746    0.20038  -11.71  < 2e-16 ***
## polviews_mExtremely Conservative -2.72738    0.38721   -7.04  1.9e-12 ***
## sexFemale                         0.25487    0.14537    1.75   0.0796 .  
## raceBlack                         3.84953    0.50132    7.68  1.6e-14 ***
## raceOther                        -0.00214    0.43576    0.00   0.9961    
## sexFemale:raceBlack              -0.19751    0.66007   -0.30   0.7648    
## sexFemale:raceOther               1.57483    0.58766    2.68   0.0074 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## (Dispersion parameter for binomial family taken to be 1)
##     Null deviance: 2247.9  on 1697  degrees of freedom
## Residual deviance: 1345.9  on 1686  degrees of freedom
##   (1169 observations deleted due to missingness)
## AIC: 1370
## Number of Fisher Scoring iterations: 6

The summary reports the coefficients and other information. We can now graph the data in any one of several ways. Using margins() we calculate the marginal effects for each variable:

bo_m <- margins(out_bo)

##                            factor     AME     SE        z      p   lower   upper
##            polviews_mConservative -0.4119 0.0283 -14.5394 0.0000 -0.4674 -0.3564
##  polviews_mExtremely Conservative -0.4538 0.0420 -10.7971 0.0000 -0.5361 -0.3714
##       polviews_mExtremely Liberal  0.2681 0.0295   9.0996 0.0000  0.2103  0.3258
##                 polviews_mLiberal  0.2768 0.0229  12.0736 0.0000  0.2319  0.3218
##   polviews_mSlightly Conservative -0.2658 0.0330  -8.0596 0.0000 -0.3304 -0.2011
##        polviews_mSlightly Liberal  0.1933 0.0303   6.3896 0.0000  0.1340  0.2526
##                         raceBlack  0.4032 0.0173  23.3568 0.0000  0.3694  0.4371
##                         raceOther  0.1247 0.0386   3.2297 0.0012  0.0490  0.2005
##                         sexFemale  0.0443 0.0177   2.5073 0.0122  0.0097  0.0789

The margins library comes with several plot methods of its own. If you wish, at this point you can just try plot(bo_m) to see a plot of the average marginal effects, produced with the general look of a Stata graphic. Other plot methods in the margins library include cplot(), which visualizes marginal effects conditional on a second variable, and image(), which shows predictions or marginal effects as a filled heatmap or contour plot.

Alternatively, we can take results from margins() and plot them ourselves. To clean up the summary a little a little, we convert it to a data frame, then use prefix_strip() and prefix_replace() to tidy the labels. We want to strip the polviews_m and sex prefixes, and (to avoid ambiguity about “Other”), adjust the race prefix.

bo_gg <- data.frame(summary(bo_m))
prefixes <- c("polviews_m", "sex")
bo_gg$factor <- prefix_strip(bo_gg$factor, prefixes)
bo_gg$factor <- prefix_replace(bo_gg$factor, "race", "Race: ")

bo_gg %>% select(factor, AME, lower, upper) 

##                   factor     AME    lower   upper
## 1           Conservative -0.4119 -0.46741 -0.3564
## 2 Extremely Conservative -0.4538 -0.53614 -0.3714
## 3      Extremely Liberal  0.2681  0.21032  0.3258
## 4                Liberal  0.2768  0.23190  0.3218
## 5  Slightly Conservative -0.2658 -0.33042 -0.2011
## 6       Slightly Liberal  0.1933  0.13400  0.2526
## 7            Race: Black  0.4032  0.36939  0.4371
## 8            Race: Other  0.1247  0.04904  0.2005
## 9                 Female  0.0443  0.00967  0.0789

Now we have a table that we can plot as we have learned:

p <- ggplot(data = bo_gg, aes(x = reorder(factor, AME),
                              y = AME, ymin = lower, ymax = upper))

p + geom_hline(yintercept = 0, color = "gray80") +
    geom_pointrange() + coord_flip() +
    labs(x = NULL, y = "Average Marginal Effect") 

If we are just interested in getting conditional effects for a particular variable, then conveniently we can ask the plot methods in the margins library to do the work calculating effects for us but without drawing their plot. Instead, they can return the results in a format we can easily use in ggplot, and with less need for clean up, for the clean-up. For example, with cplot():

pv_cp <- cplot(out_bo, x = "sex", draw = FALSE)

p <- ggplot(data = pv_cp, aes(x = reorder(xvals, yvals),
                              y = yvals, ymin = lower, ymax = upper))

p + geom_hline(yintercept = 0, color = "gray80") +
    geom_pointrange() + coord_flip() +
    labs(x = NULL, y = "Conditional Effect") 

The margins package is under active development. It can do much more than described here. The vignettes that come with the package provide more extensive discussion and numerous examples.

6.8 Plots from Complex Surveys

Social scientists often work with data collected using a complex survey design. Survey instruments may be stratified by region or some other characteristic, contain replicate weights to make them comparable to a reference population, have a clustered structure, and so on. In Chapter 4 we learned how calculate and then plot basic frequency tables of categorical variables, using some data from the General Social Survey (GSS). However, if we want accurate estimates of US households from the GSS, we will need to take the survey’s design into account, and use the survey weights provided in the dataset. Thomas Lumley’s survey library provides a comprehensive set of tools for addressing these issues. The tools and the theory behind them are discussed in detail in Lumley (2010), and an overview of the package is provided in Lumley (2004). While the functions in the survey package are straightforward to use and return results in a generally tidy form, the package predates the tidyverse and its conventions by several years. This means we cannot use survey functions directly with dplyr. However, Greg Freedman Ellis has written a helper package, srvyr, that solves this problem for us, and lets us use the survey library’s functions within a data analysis pipeline in a familiar way.

For example, the gss_lon data contains a small subset of measures from every wave of the GSS since its inception in 1972. It also contains several variables that describe the design of the survey and provide replicate weights for observations in various years. These technical details are described in the GSS documentation. Similar information is typically provided by other complex surveys. Here we will use this design information to calculate weighted estimates of the distribution of educational attainment by race, for selected survey years from 1976 to 2016.

To begin, we load the survey and srvyr libraries.


Next, we take our gss_lon dataset and use the survey tools to create a new object that contains the data, as before, but with some additional information about the survey’s design:

options(survey.lonely.psu = "adjust")

gss_wt <- subset(gss_lon, year > 1974) %>%
    mutate(stratvar = interaction(year, vstrat)) %>%
    as_survey_design(ids = vpsu,
                     strata = stratvar,
                     weights = wtssall,
                     nest = TRUE)

The two options set at the beginning provide some information to the survey library about how to behave. You should consult Lumley (2010) and the survey package documentation for details. The subsequent operations create gss_wt, an object with one additional column (stratvar), describing the yearly sampling strata. We use the interaction() function to do this. It multiplies the vstrat variable by the year variable to get a vector of stratum information for each year.We have to do this because of the way the GSS codes its stratum information. In the next step, we use the as_survey_design() function to add the key pieces of information about the survey design. It adds information about the sampling identifiers (ids), the strata (strata), and the replicate weights (weights). With those in place we can take advantage of a large number of specialized functions in the survey library that allow us to calculate properly weighted survey means or estimate models with the correct sampling specification. For example, we can easily calculate the distribution of education by race for a series of years from 1976 to 2016. We use survey_mean() to do this:

out_grp <- gss_wt %>%
    filter(year %in% seq(1976, 2016, by = 4)) %>%
    group_by(year, race, degree) %>%
    summarize(prop = survey_mean(na.rm = TRUE))


## # A tibble: 150 x 5
##     year   race         degree   prop prop_se
##    <dbl> <fctr>         <fctr>  <dbl>   <dbl>
##  1  1976  White Lt High School 0.3283 0.01595
##  2  1976  Black Lt High School 0.5620 0.06112
##  3  1976  Other Lt High School 0.2500 0.15167
##  4  1976  White    High School 0.5183 0.01617
##  5  1976  Black    High School 0.3372 0.04759
##  6  1976  Other    High School 0.4167 0.16739
##  7  1976  White Junior College 0.0129 0.00298
##  8  1976  Black Junior College 0.0426 0.01931
##  9  1976  Other Junior College 0.0000 0.00000
## 10  1976  White       Bachelor 0.1012 0.00960
## # ... with 140 more rows

Notice that the results returned in out_grp include standard errors. We can also ask survey_mean() to calculate confidence intervals for us, if we wish.

Grouping with group_by() lets us calculate counts or means for the innermost variable, grouped by the next variable up—in this case, degree by race, such that the proportions for degree will sum to one for each group in race, and this will be done separately for each value of year. If we want the marginal frequencies, such that the values for all combinations of race and degree sum to one within each year, we first have to interact the variables we are cross-classifying. Then we group by the new interacted variable and do the calculation as before:

out_mrg <- gss_wt %>%
    filter(year %in% seq(1976, 2016, by = 4)) %>%
    mutate(racedeg = interaction(race, degree)) %>%
    group_by(year, racedeg) %>%
    summarize(prop = survey_mean(na.rm = TRUE))


## # A tibble: 150 x 4
##     year              racedeg    prop prop_se
##    <dbl>               <fctr>   <dbl>   <dbl>
##  1  1976 White.Lt High School 0.29823 0.01463
##  2  1976 Black.Lt High School 0.04711 0.00840
##  3  1976 Other.Lt High School 0.00195 0.00138
##  4  1976    White.High School 0.47078 0.01597
##  5  1976    Black.High School 0.02826 0.00594
##  6  1976    Other.High School 0.00325 0.00166
##  7  1976 White.Junior College 0.01170 0.00268
##  8  1976 Black.Junior College 0.00357 0.00162
##  9  1976 Other.Junior College 0.00000 0.00000
## 10  1976       White.Bachelor 0.09194 0.00888
## # ... with 140 more rows

This gives us the numbers that we want and returns them in a tidy data frame. As you can see, the interaction() function produces variable labels that are a compound of the two variables we interacted, with each combination of categories separated by a period, (such as White.Graduate. However, perhaps we would like to see these categories as two separate columns, one for race and one for education, as before. Because the variable labels are organized in a predictable way, we can use one of the convenient functions in the tidyverse’s tidyr library to separate the single variable into two columns while correctly preserving the row values. Intuitively, this function is called separate().

out_mrg <- gss_wt %>%
    filter(year %in% seq(1976, 2016, by = 4)) %>%
    mutate(racedeg = interaction(race, degree)) %>%
    group_by(year, racedeg) %>%
    summarize(prop = survey_mean(na.rm = TRUE)) %>%
    separate(racedeg, sep = "\\.", into = c("race", "degree"))


## # A tibble: 150 x 5
##     year  race         degree    prop prop_se
##  * <dbl> <chr>          <chr>   <dbl>   <dbl>
##  1  1976 White Lt High School 0.29823 0.01463
##  2  1976 Black Lt High School 0.04711 0.00840
##  3  1976 Other Lt High School 0.00195 0.00138
##  4  1976 White    High School 0.47078 0.01597
##  5  1976 Black    High School 0.02826 0.00594
##  6  1976 Other    High School 0.00325 0.00166
##  7  1976 White Junior College 0.01170 0.00268
##  8  1976 Black Junior College 0.00357 0.00162
##  9  1976 Other Junior College 0.00000 0.00000
## 10  1976 White       Bachelor 0.09194 0.00888
## # ... with 140 more rows

TheThe two backslashes before the period are necessary for R to interpret it literally as a period. By default in search and replace operations like this, the search terms are regular expressions. The period acts as a special character, a kind of wildcard, meaning ‘any character at all’. To make the regular expression engine treat it literally, we add one backslash before it. The backslash is an ‘escape’ character. It means ‘The next character is going to be treated differently from usual’. However, because the backslash is a special character as well, we need to add a second backslash to make sure the parser sees it properly. call to separate() says to take the racedeg column, split each value when it sees a period, and reorganize the results into two columns, race and degree. This gives us a tidy table much like out_grp, but for the marginal frequencies.

Reasonable people can disagree over how best to plot a small multiple of a frequency table while faceting by year, especially when there is some measure of uncertainty attached. A barplot is the obvious approach for a single case, but when there are many years it can become difficult to compare bars across panels. This is especially the case when standard errors or confidence intervals are used in conjunction with bars. This is sometimes called a ‘dynamite plot’, not because it looks amazing but because the t-shaped error designators on the tops of the columns make them look like cartoon dynamite detonators. An alternative is to use a line graph to join categories. This is defensible but not ideal, because in general it is preferable to show that the underlying variable is categorical (as a bar chart makes clear) and not continuous (as a line graph more easily suggests). Figure 6.14 shows the results for our GSS data in dynamite-plot form, where the error bars are defined as twice the standard error in either direction around the point estimate.

p <- ggplot(data = subset(out_grp, race %nin% "Other"),
            mapping = aes(x = degree, y = prop,
                          ymin = prop - 2*prop_se,
                          ymax = prop + 2*prop_se,
                          fill = race,
                          color = race,
                          group = race))

dodge <- position_dodge(width=0.9)

p + geom_col(position = dodge, alpha = 0.2) +
    geom_errorbar(position = dodge, width = 0.2) +
    scale_x_discrete(labels = scales::wrap_format(10)) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    scale_fill_brewer(type = "qual", palette = "Dark2") +
    labs(title = "Educational Attainment by Race",
         subtitle = "GSS 1976-2016",
         fill = "Race",
         color = "Race",
         x = NULL, y = "Percent") +
    facet_wrap(~ year, ncol = 2) +
    theme(legend.position = "top")

Figure 6.14: Weighted estimates of educational attainment for Whites and Blacks, GSS selected years 1976-2016. Faceting barplots in this way is often not ideal, especially when multiple comparisons are being made.

Weighted estimates of educational attainment for Whites and Blacks, GSS selected years 1976-2016. Faceting barplots in this way is often not ideal, especially when multiple comparisons are being made.

This plot has a few cosmetic details and adjustments that we will learn more about in Chapter 8. As before, I encourage you to peel back the plot from the bottom, one instruction at a time, to see what changes. One useful adjustment to notice is the new call to the scales library to adjust the labels on the x-axis. The adjustment on the y-axis is familiar, scales::percent to convert the proportion to a percentage. On the x-axis, the issue is that several of the labels are rather long. If we do not adjust them they will print over one another. The scales::wrap_format() function will break long labels into lines. It takes a single numerical argument (here 10) that is the maxmimum length a string can be before it is wrapped onto a new line.

A graph like this is true to the categorical nature of the data, while showing the breakdown of groups within each year. But you should experiment with some alternatives. For example, we might decide that it is better to facet by degree category instead, and put the year on the x-axis within each panel. If we do that, then we can use geom_line() to show a time trend, which is more natural, and geom_ribbon() to show the error range.

p <- ggplot(data = subset(out_grp, race %nin% "Other"),
            mapping = aes(x = year, y = prop,
                          ymin = prop - 2*prop_se,
                          ymax = prop + 2*prop_se,
                          fill = race,
                          color = race,
                          group = race))

p + geom_ribbon(alpha = 0.3, aes(color = NULL)) +
    geom_line() + 
    facet_wrap(~ degree, ncol = 3) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_brewer(type = "qual", palette = "Dark2") +
    scale_fill_brewer(type = "qual", palette = "Dark2") +
    labs(title = "Educational Attainment by Race",
         subtitle = "GSS 1976-2016",
         fill = "Race",
         color = "Race",
         x = NULL, y = "Percent") +
    theme(legend.position = "top")

A second, and better, version of our GSS attainment plot.

Figure 6.15: A second, and better, version of our GSS attainment plot.

This perhaps a better way to show the data, especially as it brings out the time trends within each degree category, and allows us to see the similarities and differences by racial classification at the same time.

6.9 Other Tools to Know About

6.9.1 Base graphics and ggfortify

A wide variety of other tools are available in R and ggplot to help us visually explore datasets and the models generated from them. Bear in mind that just as model objects in R usually have a default summary() method, model objects will usually have a default plot() method as well. The figures they produce are typically not generated via ggplot, but it is always worth exploring them. Instead of ggplot, their methods often make use of R’s base graphics or alternatively the lattice library. These are two plotting systems that we do not cover in this book. Default plot methods are easy to examine, however. Let’s take a look again at our simple OLS model.

out <- lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)

To look at R’s default plots for this model, use the plot() function.

## Plot not shown
plot(out, which = c(1,2), ask=FALSE)

The which() statement here selects the first two of four default plots for this kind of model. If you want to easily reproduce base R’s default model graphics using ggplot, the ggfortify library is worth examining. It is in some ways similar to broom, but focuses on generating a standard plot or group of plots for a wide variety of model objects. It does this by defining a function called autoplot(). The idea is to be able to use autoplot() the output of many different kinds of model.

6.9.2 The coefplot library

A second option worth looking at is the coefplot library. It provides a quick way to produce good-quality plots of point estimates and confidence intervals. It has the advantage of managing the estimation of interaction effects and other occasionally tricky calculations.

A plot from coefplot. Figure 6.16: A plot from coefplot.

out <- lm(formula = lifeExp ~ gdpPercap + pop + continent, data = gapminder)

coefplot(out, sort = "magnitude", intercept = FALSE)

6.9.3 Extensions to ggplot

The GGally package provides a suite of functions designed to make producing standard but somewhat complex plots a little easier. For instance, it can produce generalized pairs plots, a useful way of quickly examining possible relationships between several different variables at once. This sort of plot is like the visual version of a correlation matrix. It shows a bivariate plot for all pairs of variables in the data. This is relatively straightforward when all the variables are continuous measures. Things get more complex when—as is often the case in the social sciences—some or all variables are categorical or otherwise limited in the range of values they can take. A generalized pairs plot can handle these cases. For example, Figure 6.17 shows a generalized pairs plot for five variables from the organdata dataset.


organdata_sm <- organdata %>%
    select(donors, pop.dens, pubhealth,
           roads, consent.law)

ggpairs(data = organdata_sm,
        mapping = aes(color = consent.law)) 

A generalized pairs plot made using the GGally library.

Figure 6.17: A generalized pairs plot made using the GGally library.

Multi-panel plots like this are intrinsically very rich in information. When combined with several within-panel types of representation, or any more than a modest number of variables, they can become quite complex. They should be used less for the presentation of finished work—although this is possible—and more as a tool for the working researcher to quickly investigate aspects of a data set. The goal is not to pithily summarize a single point one already knows, but to open things up for further exploration.

In general, when you estimate models and want to extract estimates from them in order to plot, the difficult step is not the plotting but rather the correct calculation of the estimates. Generating predicted values and (especially) estimates of confidence or uncertainty from models is often somewhat tricky, especially when the model involves interactions, cross-level effects, or transformations of the predictor or response scales. These details, which vary substantially from model type to model type, and also with the goals of any particular analysis, cannot be written down in advance in cookbook form. But in all of these cases, the goal should be to produce a tidy table of data to work with. From there, ggplot will be able to present it effectively even if the underlying structure is complex.