Complete 990 E-File Index

The IRS starting releasing data from 990 electronic filers starting in 2016. Data can be challenging to access, though, because it lives in XML files on an Amazon server.

Most data extraction will require the user to identify the organizations of interest, then extracting fields from the pertinent XML files.

The 990 efile INDEX contains a list of all filers, their organizational ID (EIN), the forms they filed (990, 990EZ, 990PF), the date they were filed, and the URL of the XML file for each.

Unfortunately, some power users have noted that the official INDEX file on AWS does not include all of the available XML files and is updated inconsistently. David Bornstein at Open 990 has described the problem in detail:

Skip the IRS 990 Efile Indices

Simon Shachter, a PhD student at the University of Chicago, has extended Bornstein’s solution and archived a public Master Index file that is as complete as possible.

The data files can be downloaded from Dataverse:

https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/BYJAPN

And Simon’s code can be found on GitHub:

https://github.com/simonys/aws_990_full_file_index

Simon Shachter simonys@uchicago.edu


Name Parsing at Scale

A name parser has been developed to assist with processing large administrative datasets that often have human names reported in a single unstructured text field.