This document walks you through how to access the Census Bureau’s API and find key demographic data.
These instructions rely heavily on the censusapi
package. You will need to load it into your library before proceeding.
library(censusapi)
If you have not worked with this package before, you’ll need to install it first.
install.packages("censusapi")
library(censusapi)
You’ll need an API key to query data from the Census Bureau. The Census Bureau uses this key to attribute an API call to you or your organization.
To get an API key, visit the following URL:
https://api.census.gov/data/key_signup.html
This link will ask you for an organization name and email. I simply put my own name as the organization name and used my email address. You can do the same using your name and email.
You should receive an email afterwards with your API key.
In order for the following functions to work, you’ll need to assign your API key as an environmental variable. Run the script below to assign that key to your environment.
Sys.setenv(CENSUS_KEY= "insert your key here")
Sys.getenv("CENSUS_KEY")
The Census has a lot of APIs. It’s okay if you don’t know which one you’ll need yet. We can get started by first seeing which ones are available using the listCensusApis()
function.
apis <- listCensusApis()
View(apis)
This should open a large table with all the APIs listed and a brief description. Feel free to browse through them to get an idea on what they all mean.
As you can tell, there’s a lot of options for Census APIs and there’s a lot of overlap between these APIs with their data. How do you narrow it down to the one you need?
It depends on the type of data you want.
When it comes to the Census Bureau, most people think about the “Decennial Census” that takes place every ten years. That would be those with the letters “dec” found in the name column in the apis data set we created.
However, those data sets only include simple demographic data, such as a count of residents in an area broken down by race, sex, and a few other simple descriptors.
The “American Community Survey” is a more in-depth survey conducted by the Census. These are listed with the letters “acs” in the name column in the apis data set we created.
The American Community Survey (ACS) asks a broader range of questions that reflect the diversity of living standards and demographics of people within the United States.
A trade-off with the ACS is that they’re based on a smaller sample. These surveys take a lot more time to collect from a single household, so the Census Bureau are forced to rely on fewer people to make population inferences.
Don’t worry though. The Census Bureau still gets enough of these responses to make a reasonable estimate of how the overall population is doing. While these estimates will never be exact, we can have greater confidence and use them to make more reliable assumptions than most publicly available survey data out there. Plus, this data is free and available to the public. You don’t have to pay a market research company to gather this information!
The Census Bureau releases 1-year estimates and 5-year estimates for the ACS. The 1-year estimates are focused on geographic areas with populations of 65,000 or more. You can get these results on a more timely basis. The 5-year estimates includes data collected over a five year period, but at a micro-level (such as zip code) within the country. That makes it a better API for rural communities.
This document from the Census Bureau website does a better job at explaining the difference.
To quote this resource, “Multi-year estimates should be labeled to indicate clearly the full period of time (e.g., ‘The child poverty rate in 2014–2018 was X percent.’). They do not describe any specific day, month, or year within that time period.”
Let’s say I want to view 5-year estimates. Chances are, you want to pull data relating to demographic “profiles.” (I’ll explain why that’s what we want in a moment.) So we’ll use the “acs/acs5/profile” for the API name. The most recent year on record for this API is 2019.
Now that’s the API we want to pull, but we don’t know what data is available within it yet. Typically, the variables available are in “groups.” So in order to find those variables, we have to determine what groups we want first. That’s what the listCensusMetadata()
function is for.
If you run the script below, it’ll pull the variable groupings for the api “acs/acs5/profile” for the year “2019”.
census_groups <- listCensusMetadata(
name="acs/acs5/profile",
vintage="2019",
type="groups")
View(census_groups)
Fortunately, this API only has five groups. If you had used “acs/acs5” for your API name, you would’ve seen far more options. There’s actually not too much difference between the data in these APIs though. It’s simply a matter of how verbose you want to see these groupings. In my experience, “acs/acs5/profile” is an easeir way to narrow down to the variables you want.
## # A tibble: 5 × 3
## name description variables
## <chr> <chr> <chr>
## 1 DP04 SELECTED HOUSING CHARACTERISTICS https://api.censu…
## 2 DP05 ACS DEMOGRAPHIC AND HOUSING ESTIMATES https://api.censu…
## 3 DP02PR SELECTED SOCIAL CHARACTERISTICS IN PUERTO RICO https://api.censu…
## 4 DP02 SELECTED SOCIAL CHARACTERISTICS IN THE UNITED STATES https://api.censu…
## 5 DP03 SELECTED ECONOMIC CHARACTERISTICS https://api.censu…
Let’s say I want to find median income for the zip codes in the state of Kansas. Using the list of groups we just pulled, I would suspect the DP03 group will have the data I want. I need to find out what geographies are available first though. I would use the same listCensusMetaData()
function, but change the type argument to “geography”. I would also need to set the group argument to “DP03”.
census_geo <- listCensusMetadata(
name="acs/acs5/profile",
vintage="2019",
group="DP03",
type="geography")
View(census_geo)
As you can see, we have several geographic levels to choose from. For my research question, I want to pull data at the state and zip code level. Both are available for this API and group.
## # A tibble: 53 × 6
## name geoLevelDisplay referenceDate requires wildcard optionalWithWCF…
## <chr> <chr> <chr> <list> <list> <chr>
## 1 us 010 2019-01-01 <NULL> <NULL> <NA>
## 2 region 020 2019-01-01 <NULL> <NULL> <NA>
## 3 division 030 2019-01-01 <NULL> <NULL> <NA>
## 4 state 040 2019-01-01 <NULL> <NULL> <NA>
## 5 county 050 2019-01-01 <chr [1]> <chr [1… state
## 6 county sub… 060 2019-01-01 <chr [2]> <chr [1… county
## 7 subminor c… 067 2019-01-01 <chr [3]> <NULL> <NA>
## 8 tract 140 2019-01-01 <chr [2]> <chr [1… county
## 9 place 160 2019-01-01 <chr [1]> <chr [1… state
## 10 consolidat… 170 2019-01-01 <chr [1]> <chr [1… state
## # … with 43 more rows
Now that I have my group and geographic variables determined, I can see what variables are available. I would use the same listCensusMetaData()
function, but change the type argument to “variables”.
census_var <- listCensusMetadata(
name="acs/acs5/profile",
vintage="2019",
group="DP03",
type="variables")
View(census_var)
This will generate a list of variables for me to query from. You can filter the label column to find “median household income”. The variable names we need are DP03_0062E and DP03_0062M, which represents the estimate and margin of error.
## # A tibble: 2 × 7
## name label concept predicateType group limit predicateOnly
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 DP03_0062E Estimate!!INCOM… SELECTED … int DP03 0 TRUE
## 2 DP03_0062M Margin of Error… SELECTED … int DP03 0 TRUE
Now we can finally query our data set. We’ll use the getCensus()
function for this part. We’ll need to specify the API name, the vintage or year, variable names, and the region we want. (The region code for Kansas is 17. You can find these codes at this link)
census <-
getCensus(
name="acs/acs5/profile",
vintage="2019",
vars=c("DP03_0062E","DP03_0062M"),
region="zip code tabulation area:*",
regionin="state:17")
View(census)
And that’s it! You will have to use rename these variables to something more legible, but that’s how you query Census data.
## # A tibble: 1,383 × 4
## state zip_code_tabulation_area DP03_0062E DP03_0062M
## <chr> <chr> <dbl> <dbl>
## 1 17 60970 42998 3405
## 2 17 62323 58929 6859
## 3 17 60429 54717 5842
## 4 17 60461 100074 12770
## 5 17 60462 84393 4548
## 6 17 60466 54155 3736
## 7 17 60470 80875 18803
## 8 17 60481 61809 3429
## 9 17 60471 62946 5789
## 10 17 60472 27078 8120
## # … with 1,373 more rows