Episode 2: Query Census Data with R

Instructions

This document walks you through how to access the Census Bureau’s API and find key demographic data.

Install and Load Necessary Packages

These instructions rely heavily on the censusapi package. You will need to load it into your library before proceeding.

library(censusapi)

If you have not worked with this package before, you’ll need to install it first.

install.packages("censusapi")
library(censusapi)

Generate an API Key

You’ll need an API key to query data from the Census Bureau. The Census Bureau uses this key to attribute an API call to you or your organization.

To get an API key, visit the following URL:

https://api.census.gov/data/key_signup.html

This link will ask you for an organization name and email. I simply put my own name as the organization name and used my email address. You can do the same using your name and email.

You should receive an email afterwards with your API key.

Assign API Key as Environmental Variable

In order for the following functions to work, you’ll need to assign your API key as an environmental variable. Run the script below to assign that key to your environment.

Sys.setenv(CENSUS_KEY= "insert your key here")
Sys.getenv("CENSUS_KEY")

View List of Census APIs

The Census has a lot of APIs. It’s okay if you don’t know which one you’ll need yet. We can get started by first seeing which ones are available using the listCensusApis() function.

apis <- listCensusApis()
View(apis)

This should open a large table with all the APIs listed and a brief description. Feel free to browse through them to get an idea on what they all mean.

Which Census API Has the Data I Need?

As you can tell, there’s a lot of options for Census APIs and there’s a lot of overlap between these APIs with their data. How do you narrow it down to the one you need?

It depends on the type of data you want.

When it comes to the Census Bureau, most people think about the “Decennial Census” that takes place every ten years. That would be those with the letters “dec” found in the name column in the apis data set we created.

However, those data sets only include simple demographic data, such as a count of residents in an area broken down by race, sex, and a few other simple descriptors.

The “American Community Survey” is a more in-depth survey conducted by the Census. These are listed with the letters “acs” in the name column in the apis data set we created.

The American Community Survey (ACS) asks a broader range of questions that reflect the diversity of living standards and demographics of people within the United States.

A trade-off with the ACS is that they’re based on a smaller sample. These surveys take a lot more time to collect from a single household, so the Census Bureau are forced to rely on fewer people to make population inferences.

Don’t worry though. The Census Bureau still gets enough of these responses to make a reasonable estimate of how the overall population is doing. While these estimates will never be exact, we can have greater confidence and use them to make more reliable assumptions than most publicly available survey data out there. Plus, this data is free and available to the public. You don’t have to pay a market research company to gather this information!

The Two Types of American Community Survey APIs

The Census Bureau releases 1-year estimates and 5-year estimates for the ACS. The 1-year estimates are focused on geographic areas with populations of 65,000 or more. You can get these results on a more timely basis. The 5-year estimates includes data collected over a five year period, but at a micro-level (such as zip code) within the country. That makes it a better API for rural communities.

This document from the Census Bureau website does a better job at explaining the difference.

To quote this resource, “Multi-year estimates should be labeled to indicate clearly the full period of time (e.g., ‘The child poverty rate in 2014–2018 was X percent.’). They do not describe any specific day, month, or year within that time period.”

Previewing the API Groups

Let’s say I want to view 5-year estimates. Chances are, you want to pull data relating to demographic “profiles.” (I’ll explain why that’s what we want in a moment.) So we’ll use the “acs/acs5/profile” for the API name. The most recent year on record for this API is 2019.

Now that’s the API we want to pull, but we don’t know what data is available within it yet. Typically, the variables available are in “groups.” So in order to find those variables, we have to determine what groups we want first. That’s what the listCensusMetadata() function is for.

If you run the script below, it’ll pull the variable groupings for the api “acs/acs5/profile” for the year “2019”.

census_groups <- listCensusMetadata(
    name="acs/acs5/profile",
    vintage="2019",
    type="groups")
View(census_groups)

Fortunately, this API only has five groups. If you had used “acs/acs5” for your API name, you would’ve seen far more options. There’s actually not too much difference between the data in these APIs though. It’s simply a matter of how verbose you want to see these groupings. In my experience, “acs/acs5/profile” is an easeir way to narrow down to the variables you want.

## # A tibble: 5 × 3
##   name   description                                          variables         
##   <chr>  <chr>                                                <chr>             
## 1 DP04   SELECTED HOUSING CHARACTERISTICS                     https://api.censu…
## 2 DP05   ACS DEMOGRAPHIC AND HOUSING ESTIMATES                https://api.censu…
## 3 DP02PR SELECTED SOCIAL CHARACTERISTICS IN PUERTO RICO       https://api.censu…
## 4 DP02   SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES https://api.censu…
## 5 DP03   SELECTED ECONOMIC CHARACTERISTICS                    https://api.censu…

Previewing the API Geography

Let’s say I want to find median income for the zip codes in the state of Kansas. Using the list of groups we just pulled, I would suspect the DP03 group will have the data I want. I need to find out what geographies are available first though. I would use the same listCensusMetaData() function, but change the type argument to “geography”. I would also need to set the group argument to “DP03”.

census_geo <- listCensusMetadata(
    name="acs/acs5/profile",
    vintage="2019",
    group="DP03",
    type="geography")
View(census_geo)

As you can see, we have several geographic levels to choose from. For my research question, I want to pull data at the state and zip code level. Both are available for this API and group.

## # A tibble: 53 × 6
##    name        geoLevelDisplay referenceDate requires  wildcard optionalWithWCF…
##    <chr>       <chr>           <chr>         <list>    <list>   <chr>           
##  1 us          010             2019-01-01    <NULL>    <NULL>   <NA>            
##  2 region      020             2019-01-01    <NULL>    <NULL>   <NA>            
##  3 division    030             2019-01-01    <NULL>    <NULL>   <NA>            
##  4 state       040             2019-01-01    <NULL>    <NULL>   <NA>            
##  5 county      050             2019-01-01    <chr [1]> <chr [1… state           
##  6 county sub… 060             2019-01-01    <chr [2]> <chr [1… county          
##  7 subminor c… 067             2019-01-01    <chr [3]> <NULL>   <NA>            
##  8 tract       140             2019-01-01    <chr [2]> <chr [1… county          
##  9 place       160             2019-01-01    <chr [1]> <chr [1… state           
## 10 consolidat… 170             2019-01-01    <chr [1]> <chr [1… state           
## # … with 43 more rows

Previewing the API Variables

Now that I have my group and geographic variables determined, I can see what variables are available. I would use the same listCensusMetaData() function, but change the type argument to “variables”.

census_var <- listCensusMetadata(
    name="acs/acs5/profile",
    vintage="2019",
    group="DP03",
    type="variables")
View(census_var)

This will generate a list of variables for me to query from. You can filter the label column to find “median household income”. The variable names we need are DP03_0062E and DP03_0062M, which represents the estimate and margin of error.

## # A tibble: 2 × 7
##   name       label            concept    predicateType group limit predicateOnly
##   <chr>      <chr>            <chr>      <chr>         <chr> <chr> <chr>        
## 1 DP03_0062E Estimate!!INCOM… SELECTED … int           DP03  0     TRUE         
## 2 DP03_0062M Margin of Error… SELECTED … int           DP03  0     TRUE

Querying Our Data

Now we can finally query our data set. We’ll use the getCensus() function for this part. We’ll need to specify the API name, the vintage or year, variable names, and the region we want. (The region code for Kansas is 17. You can find these codes at this link)

census <-
  getCensus(
    name="acs/acs5/profile",
    vintage="2019",
    vars=c("DP03_0062E","DP03_0062M"),
    region="zip code tabulation area:*",
    regionin="state:17")
View(census)

And that’s it! You will have to use rename these variables to something more legible, but that’s how you query Census data.

## # A tibble: 1,383 × 4
##    state zip_code_tabulation_area DP03_0062E DP03_0062M
##    <chr> <chr>                         <dbl>      <dbl>
##  1 17    60970                         42998       3405
##  2 17    62323                         58929       6859
##  3 17    60429                         54717       5842
##  4 17    60461                        100074      12770
##  5 17    60462                         84393       4548
##  6 17    60466                         54155       3736
##  7 17    60470                         80875      18803
##  8 17    60481                         61809       3429
##  9 17    60471                         62946       5789
## 10 17    60472                         27078       8120
## # … with 1,373 more rows