This article continues an exploration of the Census Trade data sets, but with an eye for doing larger "bulk-loading" of data the Census makes available. Specifically, we will:
- Retrieve export data from the US Census by using their Application Programming Interface (API),
- Import the API-generated data file into a database using Microsoft SQL Server Integration Services.
In a previous article, I described a method of generating export data using the Census USA Trade Online website. That method is great for initial discovery and extraction of one-time data. But if you are ingesting Census data on a recurring schedule - you need a process you can repeat easily and consistently. The Census API is useful for this purpose.
The key point is -- if you are retrieving Census Trade data you have two possible approaches. Here is a recap of their differences.
USA Trade Online
- This is a graphical user-interface (GUI) method of retrieving. You make a series of menu selections that define the data format and data filters.
- The website GUI makes data retrieval more accessible to non-technical users.
- This interface has a greater range of historical data. It has monthly data back to 2002, versus 2013 for the API.
- USA Trade Online generates a comma-separated value (CSV) file that is familiar to most users.
- Data retrieval is achieved by constructing a single "command-line" of text that defines columns and data filters.
- This approach requires greater familiarity with data specifications. You will need to wade thru reference documents to understand how to retrieve your data.
- After you've retrieved the data once, future retrievals are simple and reliable. It involves minor edits to your text "command-line". This is in contrast with the GUI approach used by USA Trade Online where one wrong mouse click can create an entirely different type of file.
- The API creates a "JSON" file format used by automated import tools. This format is not directly readable by end-user tools like Microsoft Excel.
There is one other distinction. Both the USA Trade Online and Census API approaches work fine with a browser. But with the API you have an additional option. You can use a program like Curl to batch a whole series of data requests. This gives the API a distinct advantage over the GUI. You can load a large batch file of data requests, start the routine and step away while the batch file runs. Achieving the same type of bulk download using the USA Trade Online GUI is a hurry-up-and-wait process. You request the data, wait 5 minutes for the job to complete and then repeat the process. It may sound inconsequential, but if you are trying to process dozens of data files - the USA Trade Online interface is relatively tedious and prone to user error.
What does an API request look like?
For demonstration, let's look at the command line statement needed to retrieve October 2018 export data for Georgia. We want exports classified with 6-digits of precision using the Harmonized Tariff Codes. The statement needed to retrieve that data is:
curl "https://api.census.gov/data/timeseries/intltrade/exports/statehs ?get=SUMMARY_LVL,SUMMARY_LVL2,E_COMMODITY,CTY_CODE,CTY_NAME,CNT_VAL_MO &YEAR=2018&MONTH=10&COMM_LVL=HS6&STATE=GA" -o B:/temp/curl/exports_2018-10.json
When you author your API retrieval, it should all be a single line in a text file. For purposes of readability I've broken this into five separate lines. Let's evaluate each line.
- Curl: This is an open source utility useful for retrieving web data. See resource links for this and other items at bottom.
- HTTPS address: This is one of multiple API endpoints that the Census makes available. This particular one is dedicated to state-level exports that sum values over a period of time ("time-series").
- GET: This is a list of columns you want retrieved from the API data source. The order you list them here will be the order the API will save them in your file.
- FILTER Parameters: Starting with "&YEAR", we tell the API what portion of the Census data we want. This example filters for October 2018, showing harmonized codes to 6 digits ("HS6") for the state of Georgia only. If you use a column as a filter, it is automatically included in the data export. You do not need to include those filtered columns as part of the GET statement.
- OUTPUT: The "-o" flag indicates that the text to follow is the location and file name you want the API to use when generating the local data file.
You would use a simple text editor like Notepad to author this. Saving the file with a .BAT extension will allow you to run it. If you want to retrieve a whole year's worth of data, you should structure your batch file into individual month retrievals (each on their own line in the BAT file.) This will ensure the API call does not time-out. It will also give you more manageable data files to work with.
Shown here is a sample of the JSON file created. It's format somewhat resembles a comma-separated data file. It takes between 3-5 minutes for the Census API to generate a month's worth of data.
Now that we have the raw data in JSON format, let's join this most recent months of data into the larger database we are populating in SQL Server. To do that, we'll use a Microsoft tool called SQL Server Integration Services (SSIS).
Importing to SQL Server using SSIS
SSIS is a powerful and flexible tool useful for all manner of data "munging". An in-depth explanation of it's features goes beyond the scope of this article. My purpose here is to show the workflow of retrieving the data from the Census and loading SQL Server.
The SSIS import process is defined as a Visual Studio package. It is designed to process multiple export and import data files each time it runs. It does the following:
- A Foreach Loop enumerates each file in the API destination folder. In the graphic below, you can see the properties sheet reference "B:\curl" and "exports*.json".
- Each data file is then used as part of a data flow. That step maps columns between the JSON flat file and the SQL Server data table.
- An intermediate step called "derived column" calculates the month and year using the format YYYYMM.
This morning, I've processed three months of 2018 data files. This added 192,030 records to our destination "Export" table. All together then we have a 5 year total of 5,302,270 rows of export data. This paves the way for our next article -- analyzing export data trends in Georgia.
Supporting Links / Resources
Census API documentation
- Census Data API User Guide (PDF): https://www.census.gov/data/developers/guidance/api-user-guide.Available_Data.html
- List of available datasets: https://api.census.gov/data.html
- Time Series datasets: https://www.census.gov/data/developers/data-sets/international-trade.html
USA Trade Online website
- Curl program download and documentation: https://curl.haxx.se/docs/
- Curl examples: https://jonlabelle.com/snippets/view/shell/curl-command
- Curl primer: http://www.compciv.org/recipes/cli/downloading-with-curl/