I wasn’t able to attend useR! 2016 last week. But thanks to Microsoft’s live streaming service, I was able to record a number of sessions and watch them over the holiday weekend. I am a frequent user of many of Hadley Wickham’s R packages. So of course I tuned in for his keynote. Hadley presented the following map, using historical data from the USAboundaries package. His point wasn’t about map making, per se – but rather about “tidy” principles for dealing with complex data structures (such as shape files or statistical models). I’m onboard with the #tidyverse. But I was also inspired by the map itself.
In honor of US Independence Day, I decided to build on Hadley’s example to make an animated map of US settlement. I don’t have many occasions to make maps as part of my current academic work. So I thought it would be fun to spend some time on the 4th of July learning to wrangle shape files in R.
Goals
My primary goal was to make two maps:
an animated map showing the settlement of US territories and the conversion of territories to states; and
an animated map showing the development of US counties.
Secondly, I wanted to focus the maps on the continental US, while still showing data from all regions. Repositioning / reprojecting Alaska and Hawaii is certainly not news. But for a newcomer to geospatial data, it was a data-munging benchmark that I needed to surpass.
Dataset
The USAboundaries package contains state/territory-level data from September 3, 1783 to December 31, 2000. So the state-level map will be a bit truncated historically – skipping over the early settlement of US colonies, as well as the declaration of the 13 original colonies as free and independent states in 1776(!). But alas, we’ll still have a nice bit of data to play with.
The county-level data in the USAboundaries package dates back about a century and a half further than the state-level data, all the way to December 30, 1636. So this dataset will provide a more extended view of US settlement history.
Maps
Here are the final results of my holiday weekend map-making expedition. Annotated R code to reproduce these maps is below.
The state-level map shows the United States on January 1st, every year from 1784 to 2000. (I grew up in Oklahoma – but I’d forgotten that there was a short period just before Oklahoma became a state in 1907 during which part of the region was an organized incorporated US territory and part was unorganized. This animated map reminded me of that fact).
The county-level map shows the United States on January 1st, every year from 1637 to 2000. One thing I like about this animation is that it highlights the degree of geographic separation during the early years of settlement.
R code
Data wrangling process
The heavy lifting is done by the get_boundaries() function defined below. This function uses the USAboundaries::us_boundaries() function to get (state- or county-level) shape files for a specified date, and then performs some additional processing to convert the shape file into a convenient format for plotting. The additional processing involves the following:
extract metadata associated with each region (e.g., whether the region is an unorganized territory, state, etc.)
project the coordinate system to Albers equal area
The create_timeline() function creates the data needed to draw an animated timeline on the graphs. The input to this function is a dataframe containing lat/long info about the map to be plotted (e.g., the dataframe created by a call to get_boundaries()). The basic idea is to divide the x-axis range of the map (i.e., min and max longitude) into evenly spaced intervals, one for each year (since we’re animating by year). The resulting sequence provides the tick locations for the moving timeline tick point. And then we determine the vertical placement of the timeline by getting the max latitude of the map and adding a bit of padding.
With these two helper functions in hand, we can now get to the business of making the maps.
State-level map
County-level map
Final notes
The Rmd file used to create this report is available here.
Note that if you run this code on a standard laptop, it will likely take several (5-15) minutes to make each map due to the amount of data that is prepped + rendered. The county map is particularly slow because the data frame contains more than 11 million rows and the gif contains nearly 400 frames (one frame per year from 1636–2000).
My approach here (and in general) was desired output first, optimize later, especially since I’m new to map making. If you just want to play around with the code and see immediate results, I would start by reducing the date range to a handful of years, which will dramatically speed up the data prep + rendering process.
Also, I just learned about the new rmapshaper package, which might be one way to improve rendering speed at scale. This package performs multi-polygon simplification by removing overlapping boundaries between adjacent polygons. Maybe I’ll try that out in a later post.