Data

We have both CSV and RDS (R data set) files available in the DATA section of this GitHub repository. They can be loaded directly into R as follows.

Read CSV version:

dat <- read.csv( "https://github.com/Nonprofit-Open-Data-Collective/machine_learning_mission_codes/blob/master/DATA/MISSION.csv?raw=true", stringsAsFactors=F )

Read RDS version:

dat <- readRDS( gzcon( url( "https://github.com/Nonprofit-Open-Data-Collective/machine_learning_mission_codes/blob/master/DATA/MISSION.rds?raw=true" )))

Overview of the Training Dataset

[need to add]…How was sample created, what it represents, why a test sample is useful for benchmarking and replication.

Garbage in garbage out discussion:

quality of program / mission descriptions
quality of activity codes

See the Taxonomy section for activity codes.

IRS versus human coding…(validity and reliability of taxonomies)

Why Use a Common Replication Dataset?

The goal of this project is to create a training dataset that can serve as a reference point for performance of program activity classification algorithms that rely on the types of text that would be readily available on websites, grant aplications, tax forms, or annual reports.

The creation of a reference dataset allows for innovation and progress since the relative performance of algorithms can be compared when they are applied to the same dataset. Performance metrics are difficult to interpret if they are drawn from different underlying data sources.

The field of social network analysis provides some examples of this approach by benchmarking the performance of clustering algorithms using a small set of canonical datasets.

Agarwal, G., & Kempe, D. (2008). Modularity-maximizing graph communities via mathematical programming. The European Physical Journal B, 66(3), 409-418. PAPER

Raw Data Sources

We have built a training dataset using data from two primary sources:

The IRS E-File database contains machine-readable text fields on nonprofit names, mission statements, and program service accomplishments.

The IRS 1023-EZ files contain mission taxonomy codes for the traditional National Taxonomy of Exempt Entities (NTEE), as well as eight binary mission codes related to nonprofit purpose such as religious activities, scientific activities, recreational activities, or welfare activities.

See the taxonomy section of this site for more information.

Available Mission and Activity Text

Text-based data describing nonprofit activities.

Nonprofit name: Form 990 and 990-EZ, header
Nonprofit missions on IRS forms: Form 990, Part I, Line 1; Form 990-EZ, Part III, Line 0
Program service accomplishments: Form 990, Part III; Form 990-EZ, Part III

Raw Mission Data

The nonprofit mission data comes from the new IRS e-file data available on AWS as XML files.

library( xmltools )
library( purrr )
library( xml2 )
library( dplyr )

# source build functions
source( "https://raw.githubusercontent.com/Nonprofit-Open-Data-Collective/irs-990-efiler-database/master/BUILD_SCRIPTS/build_efile_database_functions.R" )

dat <- buildIndex()
table( dat$FormType, dat$TaxYear )

	2009	2010	2011	2012	2013	2014	2015	2016	2017
990	33,360	123,107	159,539	179,674	198,738	218,614	232,975	214,585	25,921
990EZ	15,500	63,253	82,066	93,769	104,538	116,461	124,507	121,530	28,767
990PF	2,352	25,275	34,597	39,936	45,897	53,443	58,724	60,305	20,608

XML Tools in R

If you want to work with the data directly you will need to use some XML tools.

Quick Guide to Working with XML in R

Build Custom Databases

You can build custom datasets from the IRS XML fields. Some sapmle scripts are available here:

Nonprofit Open Data Collective

And many of the tables in CSV formats are available on our Data World group: https://data.world/activity/npdata