R package for building a research database from IRS 990 nonprofit efiler tax returns.
The full set of tables is available in the DATA DICTIONARY.
The Master Concordance File provides the crosswalk architecture for moving from XML files to rectangular tables.
https://github.com/Nonprofit-Open-Data-Collective/irs-efile-master-concordance-file
Several dozen one-to-many tables exist on Form 990 (one unique 990 filing to a table with many entries, such as many board members serving a single nonprofit). Documentation about the XML table structure can be useful when attempting to parse XML nodes into well-behaved relational tables.
https://nonprofit-open-data-collective.github.io/efile-rdb-tables/
Update: The IRS is no longer hosting efile data on AWS. Files must be downloaded from the IRS site directly.
https://www.irs.gov/charities-non-profits/form-990-series-downloads
XML files are available at:
https://nccs-efile.s3.us-east-1.amazonaws.com/xml/
The new URLs will thus look like:
https://nccs-efile.s3.us-east-1.amazonaws.com/xml/201020793492001120_public.xml
Note that xmltools is not available on CRAN so has to be installed remotely before installing the irs990efiler package.
# install.packages( 'devtools' )
devtools::install_github( 'ultinomics/xmltools' )
devtools::install_github( 'nonprofit-open-data-collective/irs990efile' )
library( irs990efile )
library( dplyr )
### BUILD THE FULL DATABASE
### (note: this can take days)
### (test on a sample first)
# test on random sample of 10,000 cases
index <- tinyindex
build_database( index )
# build the full index from AWS (~3.4 million 990 & 990EZ filers)
index <- build_index( tax.years=2009:2020 )
build_database( index )
###
### WORKING WITH SPECIFIC TABLES OR SAMPLES
###
# pre-loaded demo index of 10,000 random efilers from AWS:
tinyindex %>%
select( OrganizationName, EIN, TaxYear, FormType ) %>%
head()
# OrganizationName EIN TaxYear FormType
# 1 MASTOCYTOSIS SOCIETY INC 521959601 2018 990
# 2 MCKINLEY III INC 364165018 2018 990
# 3 REAL SERVICES INC 351157606 2013 990
# 4 GREATER KANSAS CITY FRIENDS OF FISHER HOUSE 842359546 2019 990EZ
# 5 THUMBNAIL THEATER 510563980 2014 990EZ
# 6 BERNARD M AND CARYL H SUSMAN FOUNDATION 208068788 2010 990PF
# index files from 2009 to 2020 are preloaded:
data( index2009 )
head( index2009 )
# combine index files for all years 2009-2020 ehre forms available:
index <- build_index() # build_index( tax.years=2009:2020 )
# create index of 10 organizations from 2018
index.2018 <-
index2018 %>%
filter( FormType %in% c("990","990EZ") ) %>%
sample_n( 10 )
# build all one-to-one tables for the sample
dir.create( "EFILE" )
setwd( "./EFILE" )
build_tables( url=index.2018$URL, year=2018 )
### TEST SPECIFIC TABLES
index <- tinyindex # random sample of 10,000 cases
# split index file into smaller chunks (for parallelization) and build tables
years <- 2017:2019
tables <- c( "F9-P00-T00-HEADER","F9-P01-T00-SUMMARY",
"F9-P08-T00-REVENUE","F9-P09-T00-EXPENSES",
"F9-P11-T00-ASSETS" )
# TABLE NAME: 'F9-P00-T00-HEADER'
# FUNCTION NAME: 'BUILD_F9_P00_T00_HEADER'
tables <- gsub( "-", "_", tables )
tables <- paste0( "BUILD_", tables )
for( i in years )
{
dir.create( as.character(i) )
setwd( as.character(i) ) # creates folders for each year
index.i <- dplyr::filter( index, TaxYear == i ) # create index for one year
groups <- split_index( index.i, group.size = 100 ) # parser builds temporary tables then combines them at the end
build_tables_parallel(
groups=groups, year=i, table.names=tables ) # processing many small groups keep memory usage low
setwd( ".." ) # return to main directory
}
bind_data( years ) # compile all temp tables into one table
The full set of tables is available in the DATA DICTIONARY.
Background
The IRS started processing electronic filings for nonprofit 990 tax forms in 2010 and releasing 990 efile returns via AWS in 2016. For more details on the history and current status of nonprofit efiling see this recent report.
All electronic tax returns have been released as XML documents currently stored in an AWS bucket (though soon migrating to the IRS website).
XML forms can be rendered using an efile viewer so that they look the same as a PDF of a regular 990 filing (you can see examples on ProPublica’s Nonprofit Explorer). They are NOT, however, in a convenient format for statistical analysis.
The irs990efile package was created to convert XML files into a relational database: normal rectangular data tables linked by a set of keys.
Documentation
All of the files share the following meta-fields: