This page provides documentation for the Master Concordance File, the “rosetta stone” that facilitates the conversion of IRS 990 E-Filer XML documents on AWS into structured databases.

The data dictionary below documents the xpath to variable mapping contained in the Master Concordance File. Click here for a DATA DICTIONARY describing unique variables on the 990 forms.

Please submit QUESTIONS AND ISSUES through GitHub.

Acknowledgements

Created by the Nonprofit Open Data Collective under the GPL-3.0 open source license for free use by all.

Many thanks to all of those that have helped generate this file, but especially to the Aspen Institute for hosting the initial “DATATHON”event which kicked us off, and to Miguel Barbosa at Citizen Audit for generating a large portion of the first draft of this file.

Concordance Fields

The MasterConcordanceFile.csv included in this repository consists of the following fields:

A more in-depth description of each variable is covered below.

CONCORDANCE OVERVIEW

The Master Concordance File is organized around xpaths, which are the ‘addresses’ that designates the location of data in XML documents. Since the IRS has released the e-filer data in XML format, the xpaths are needed to extract data related to specific variables from a file.

Each row of the Master Concordance File provides documentation for a unique xpath.

Each xpath provides the location for data from a specific field on the 990 form, for example the “total revenue” value a nonprofit enters. As the IRS has updated forms and schemas, xpaths related to a specific field have changed. If you want to collect data over time from the same field, you need to know all xpaths that represent that specific field.

In addition, the same field may or may not be present on multiple forms. Large nonprofits fill out the full 990-PC form, which contains approximately 5,000 variables. Small nonprofits fill out the 990-EZ form, which contains approximately 1,800 variables. Of these, about 1,700 occur on both forms. For this reason, we have created a SCOPE code to describe whether variable occur on one or both forms.

The scope code also differentiate variables related to nonprofits (PC, EZ and PZ codes) versus foundations (PF code).

So in short, the Master Concordance File provides documentation necessary to translate the IRS e-filer data into a structured database, partly by providing the map of xpaths onto fields, and mapping fields across forms onto common variables.

In this example, revenue occurs as “Current Year Revenue” on the 990-PC version, and just “Revenue” on the 990-EZ version. The Number of Volunteers variable only occurs on the 990-PC version, so all EZ filers will be missing this data in the database.

The current Master Concordance File contains approximately 10,000 xpaths organized across the three different types of filers as follows:

Scope Number of Variables
EZ 108
HD 48
PC 3258
PF 1518
PZ 1831

To make sense of this table, note that all forms contain the same basic “Header” variables (scope=HD) which describe basic nonprofit and filing characteristics like name, address, EIN, tax year, etc.

Variables for 990-EZ filers include variables unique to the EZ form (scope=EZ), and variables common across the 990-EZ and 990-PC forms (scope=PZ).

So the 990-EZ filers will submit data on 48 + 108 + 1,711 = 1,867 unique fields.

Development Status

The initial release of the Master Concordance File is meant to provide an architecture for documenting the IRS e-filer data, but not all fields are complete. This table reports on the progress of development of fields in the Master Concordance File:

FIELD STATUS VALIDATED?
variable_name Complete YES
description Complete YES
scope Complete NO
location_code Complete; Missing Line Info NO
form Complete YES
part Complete NO
data_type In Progress NO
required In Progress NO
cardinality No Data Yet No
rdb_table No Data Yet NO
xpath Complete YES
version Complete YES
production_rule No Data Yet NO

DATA DICTIONARY

variable_name

Each variable name begins with a 6-letter prefix the follows the pattern: XX_XX_XX_NAME

Since there are over 6,500 unique variables, the prefix helps organize the variables into groups. The prefix indicates the FORM, LOCATION, and SCOPE of the variable.

FORM - Form from which the variable originates (main 990, or schedule) * F9 - Variable occurs on Form 990, 990-EZ, or 990-PF
* SA, SB, … SR - Variable occurs on Schedule A to Schedule R
* AF - Auxillary forms for PF foundation filers

LOCATION - two digit code indicating the PART of the 990 form, which indicates a thematic group of variables. * 00 - Variable occurs outside of a section (“Part”) on the 990 form, typically the header or signature block
* 01 - Variable occurs in Part I of the form
* 02 - Variable occurs in Part II of the form * Etc.

Header and signature block variables are consistent across all PC, EZ, and PF forms and have a scope codde “HD”. Since the signature block is in a different location on each form, variables with an HD scope have a location of 00 for consistency across the forms, even though in two cases they are assigned their own part.

SCOPE - Indicates which filers would submit data related to the variable (PC, EZ, PZ=PC&EZ, or PF) * PC - Variable relevant ONLY to full form 990 nonprofit filers
* EZ - Variable relevant ONLY to 990-EZ nonprofit filers
* PZ - Variable relevant to BOTH full 990 and 990-EZ nonprofit filers
* PF - Variable relevant only to foundations
* HD - Header and signature block variables that are identical across PC, EZ, and PF forms

Each variable has been assigned a unique name. One variable might appear several times in the concordance file in circumstances where multiple xpaths map onto the same variable. This happens when xpaths have changed over different versions of the IRS e-filer schema, or when the same variable occurs on both the 990-PC and 990-EZ forms.

description

Definition of the variable based upon information on the 990 forms.

scope

Nonprofits can e-file two versions of the 990 form. Typically large nonprofits file the full 990 form (which we are calling the 990-PC form). Small nonprofits file the 990-EZ form.

The scope codes, PC, EZ, and PZ describes the population that a variable covers. The PZ code has the largest scope since this set of variables occur on both the full 990 and the 990-EZ forms. The PC code refers to variables that are ONLY on the full 990 form. The EZ code refers to variables ONLY on the 990-EZ form.

All foundations file the same form. All of the variables from that form are designated by the PF (private foundation) scope code.

If we want to conduct a study on the population of active nonprofits then we need to rely on the PZ subset of variables.

  • HD - Header and signature block variables relevant to ALL filers (PC, EZ, and PF)
  • PC - Variable present ONLY on the full 990-PC form for public charities / nonprofits
  • EZ - Variable present ONLY on the 990-EZ form for charities / nonprofits
  • PZ - Variable present on BOTH the PC and EZ versions for charities / nonprofits
  • PF - Variable present on form 990-PF for foundations

Some schedules are only relevant for nonprofits filing the full 990-PC form. The only schedule that foundations file is Schedule B. Thus, variables on the schedules have the following scope:

  • Schedule A: PC AND EZ
  • Schedule B: PC AND EZ AND PF
  • Schedule C: PC AND EZ
  • Schedule D: PC
  • Schedule E: PC AND EZ
  • Schedule F: PC
  • Schedule G: PC AND EZ
  • Schedule H: PC
  • Schedule I: PC
  • Schedule J: PC
  • Schedule K: PC
  • Schedule L: PC AND EZ
  • Schedule M: PC
  • Schedule N: PC AND EZ
  • Schedule O: PC AND EZ
  • Schedule R: PC

location_code

Location codes indicate the location of the field on the paper version of the 990 form for easy of look-up.

The location code is meant to be hierarchical and approximate organized as:

FORM -> PART -> LINE -> SUBLINE or COLUMN

Note that location of fields on 990 forms may change over time as forms are revised. We have defined location codes based upon the IRS 2016 versions of forms and schedules.

All location codes begin with either F990-PC, F990-EZ, F990-PF, or SCHED-A through SCHED-R.

Some foundations are requird to submit auxillary schedules. These are indicated by AUX-SCHED after the location code.

form

A 2-character code representing the form or schedule from which the variable is derived.

FORM_CODE FORM_NAME NUMBER_OF_ASSOCIATED_VARIABLES NUMBER_OF_ASSOCIATED_XPATHS
F9 F990 2741 4257
AF PF-AUX-SCHED 528 707
SA SCHED-A 717 977
SB SCHED-B 102 130
SC SCHED-C 177 321
SD SCHED-D 311 436
SE SCHED-E 64 70
SF SCHED-F 92 116
SG SCHED-G 301 482
SH SCHED-H 669 863
SI SCHED-I 64 102
SJ SCHED-J 84 121
SK SCHED-K 144 151
SL SCHED-L 82 108
SM SCHED-M 197 315
SN SCHED-N 156 220
SO SCHED-O 6 6
SR SCHED-R 328 417

part

Reports the location of the field on the 990 forms for ease of look-up and to organize variables by groups.

Note that PZ fields occur on both the 990-PC and 990-EZ forms. The part code references the location on the full 990-PC form, which will rarely be in the same place on the 990-EZ form. If you need the location of PZ variables on the EZ form use the location codes instead.

data_type

Field data type, derived from IRS XSD schema files.

Data_Type Frequency
4428
USAmountType 1638
BooleanType 764
USAmountNNType 575
CheckboxType 530
StreetAddressType 233
IntegerNNType 215
LineExplanationType 209
TextType 172
BusinessNameLine1Type 140
BusinessNameLine2Type 119
StateType 96
RatioType 88
CityType 75
CountryType 75
ExplanationType 73
ZIPCodeType 73
ShortExplanationType 52
PersonNameType 50
StringType 49
EINType 27
LargeRatioType 17
CountType 16
DateType 15
ShortDescriptionType 13
IntegerType 10
PhoneNumberType 10
Count2Type 8
YearType 6
xsd:decimal 5
DecimalNNType 4
BusinessNameControlType 2
CUSIPNumberType 2
PersonTitleType 2
PTINType 2
SSNType 2
TimestampType 2
AlphaNumericType 1
InCareOfNameType 1

required

Is the specific field required for the particular filer in order to submit your 990 data to the IRS?

The e-filing system provides some validation to ensure necessary fields are complete. It may not be strictly enforced, though.

cardinality

Definition of the variable relationship to the filing nonprofit.

  • ONE-TO-ONE - Each of these variable occurs only one time on the 990 return

Examples include nonprofit name and nonprofit EIN - values that are unique and occur once on the form.

  • ONE-TO-MANY - Variables that can occur many times on a return

Examples include grants given, names of board members, and program accomplishments - anything that might occur multiple times on a 990 return.

rdb_table

This field (not yet implemented) provides a relational database structure for the data. Many fields in the dataset are one-to-one, meaning for each nonprofit there will be a unique value. These fields appear on the main table for the form.

Some fields have a one-to-many relationship. There are many board members for each nonprofit, there are several program activities reported, etc. In these cases, each new table defines a set of fields that function together. A compensation table. A nonprofit activities table. Etc.

The names of tables will follow this convention:

HEADER - Table of unique (OTO) header data for each nonprofit / foundation PC-OTO-TABLE_NAME - One-to-one table from the 990-PC form SA-OTM-TABLE_NAME - One-to-many table from Schedule A

  • PC / EZ / PZ / PF = scope codes (if scope = HD then table name = HEADER)
  • OTO = one-to-one relationships only
  • OTM = one-to-many relationships
  • TABLE_NAME - informative name describing group of variables (for example, compensation, expenses, etc.)

Every table should include the following header variables to ensure they can be joined properly (these variables are designated by the table name “HEADER”):

  • Nonprofit name
  • Nonprofit EIN
  • Form
  • Tax year
  • Submission version / timestamp
  • Document Locator Number for the return

In the future consider adding the variables to tables for ease of subsetting:

  • NTEE codes (or other system)
  • Organizational type (charity, community foundation, supporting nonprofit, etc.)
  • Geography (state, MSA, city, zip, county)
  • Size (5 ranges)

xpath

The “address” of the variable on the XML forms released on Amazon.

For an overview of using xpaths in R, see the Quick Guide to XML in R.

version

The IRS has revised the 990 and updated e-filing forms many times. This field contains the XSD version information necessary to match schemas to XML documents.

production_rule

Xpaths contained in the Master Concordance File are used to extract data from XML files and convert it into a more structured format. Production rules represent anything that should be done to the raw data after it is extracted from XML files to standardize the data or make it more useful.

For example, in some versions of forms a checked checkbox will be represented as an “X”, and in others as a “Y” or “1”. These all mean the same thing, so they should be re-coded with the same value. This is a suggested production rule.

last_version_modified

This field contains a record of when specific variables have been last updated.