This page provides documentation for the Master Concordance File, the “rosetta stone” that facilitates the conversion of IRS 990 E-Filer XML documents on AWS into structured databases.
The data dictionary below documents the xpath to variable mapping contained in the Master Concordance File. Click here for a DATA DICTIONARY describing unique variables on the 990 forms.
Please submit QUESTIONS AND ISSUES through GitHub.
Created by the Nonprofit Open Data Collective under the GPL-3.0 open source license for free use by all.
Many thanks to all of those that have helped generate this file, but especially to the Aspen Institute for hosting the initial “DATATHON”event which kicked us off, and to Miguel Barbosa at Citizen Audit for generating a large portion of the first draft of this file.
The MasterConcordanceFile.csv included in this repository consists of the following fields:
A more in-depth description of each variable is covered below.
The Master Concordance File is organized around xpaths, which are the ‘addresses’ that designates the location of data in XML documents. Since the IRS has released the e-filer data in XML format, the xpaths are needed to extract data related to specific variables from a file.
Each row of the Master Concordance File provides documentation for a unique xpath.
Each xpath provides the location for data from a specific field on the 990 form, for example the “total revenue” value a nonprofit enters. As the IRS has updated forms and schemas, xpaths related to a specific field have changed. If you want to collect data over time from the same field, you need to know all xpaths that represent that specific field.
In addition, the same field may or may not be present on multiple forms. Large nonprofits fill out the full 990-PC form, which contains approximately 5,000 variables. Small nonprofits fill out the 990-EZ form, which contains approximately 1,800 variables. Of these, about 1,700 occur on both forms. For this reason, we have created a SCOPE code to describe whether variable occur on one or both forms.
The scope code also differentiate variables related to nonprofits (PC, EZ and PZ codes) versus foundations (PF code).
So in short, the Master Concordance File provides documentation necessary to translate the IRS e-filer data into a structured database, partly by providing the map of xpaths onto fields, and mapping fields across forms onto common variables.
In this example, revenue occurs as “Current Year Revenue” on the 990-PC version, and just “Revenue” on the 990-EZ version. The Number of Volunteers variable only occurs on the 990-PC version, so all EZ filers will be missing this data in the database.
The current Master Concordance File contains approximately 10,000 xpaths organized across the three different types of filers as follows:
Scope | Number of Variables |
---|---|
EZ | 108 |
HD | 48 |
PC | 3258 |
PF | 1518 |
PZ | 1831 |
To make sense of this table, note that all forms contain the same basic “Header” variables (scope=HD) which describe basic nonprofit and filing characteristics like name, address, EIN, tax year, etc.
Variables for 990-EZ filers include variables unique to the EZ form (scope=EZ), and variables common across the 990-EZ and 990-PC forms (scope=PZ).
So the 990-EZ filers will submit data on 48 + 108 + 1,711 = 1,867 unique fields.
The initial release of the Master Concordance File is meant to provide an architecture for documenting the IRS e-filer data, but not all fields are complete. This table reports on the progress of development of fields in the Master Concordance File:
FIELD | STATUS | VALIDATED? |
---|---|---|
variable_name | Complete | YES |
description | Complete | YES |
scope | Complete | NO |
location_code | Complete; Missing Line Info | NO |
form | Complete | YES |
part | Complete | NO |
data_type | In Progress | NO |
required | In Progress | NO |
cardinality | No Data Yet | No |
rdb_table | No Data Yet | NO |
xpath | Complete | YES |
version | Complete | YES |
production_rule | No Data Yet | NO |
Each variable name begins with a 6-letter prefix the follows the pattern: XX_XX_XX_NAME
Since there are over 6,500 unique variables, the prefix helps organize the variables into groups. The prefix indicates the FORM, LOCATION, and SCOPE of the variable.
FORM - Form from which the variable originates (main 990, or schedule) * F9 - Variable occurs on Form 990, 990-EZ, or 990-PF
* SA, SB, … SR - Variable occurs on Schedule A to Schedule R
* AF - Auxillary forms for PF foundation filers
LOCATION - two digit code indicating the PART of the 990 form, which indicates a thematic group of variables. * 00 - Variable occurs outside of a section (“Part”) on the 990 form, typically the header or signature block
* 01 - Variable occurs in Part I of the form
* 02 - Variable occurs in Part II of the form * Etc.
Header and signature block variables are consistent across all PC, EZ, and PF forms and have a scope codde “HD”. Since the signature block is in a different location on each form, variables with an HD scope have a location of 00 for consistency across the forms, even though in two cases they are assigned their own part.
SCOPE - Indicates which filers would submit data related to the variable (PC, EZ, PZ=PC&EZ, or PF) * PC - Variable relevant ONLY to full form 990 nonprofit filers
* EZ - Variable relevant ONLY to 990-EZ nonprofit filers
* PZ - Variable relevant to BOTH full 990 and 990-EZ nonprofit filers
* PF - Variable relevant only to foundations
* HD - Header and signature block variables that are identical across PC, EZ, and PF forms
Each variable has been assigned a unique name. One variable might appear several times in the concordance file in circumstances where multiple xpaths map onto the same variable. This happens when xpaths have changed over different versions of the IRS e-filer schema, or when the same variable occurs on both the 990-PC and 990-EZ forms.
Definition of the variable based upon information on the 990 forms.
Nonprofits can e-file two versions of the 990 form. Typically large nonprofits file the full 990 form (which we are calling the 990-PC form). Small nonprofits file the 990-EZ form.
The scope codes, PC, EZ, and PZ describes the population that a variable covers. The PZ code has the largest scope since this set of variables occur on both the full 990 and the 990-EZ forms. The PC code refers to variables that are ONLY on the full 990 form. The EZ code refers to variables ONLY on the 990-EZ form.
All foundations file the same form. All of the variables from that form are designated by the PF (private foundation) scope code.
If we want to conduct a study on the population of active nonprofits then we need to rely on the PZ subset of variables.
Some schedules are only relevant for nonprofits filing the full 990-PC form. The only schedule that foundations file is Schedule B. Thus, variables on the schedules have the following scope:
Location codes indicate the location of the field on the paper version of the 990 form for easy of look-up.
The location code is meant to be hierarchical and approximate organized as:
FORM -> PART -> LINE -> SUBLINE or COLUMN
Note that location of fields on 990 forms may change over time as forms are revised. We have defined location codes based upon the IRS 2016 versions of forms and schedules.
All location codes begin with either F990-PC, F990-EZ, F990-PF, or SCHED-A through SCHED-R.
Some foundations are requird to submit auxillary schedules. These are indicated by AUX-SCHED after the location code.
A 2-character code representing the form or schedule from which the variable is derived.
FORM_CODE | FORM_NAME | NUMBER_OF_ASSOCIATED_VARIABLES | NUMBER_OF_ASSOCIATED_XPATHS |
---|---|---|---|
F9 | F990 | 2741 | 4257 |
AF | PF-AUX-SCHED | 528 | 707 |
SA | SCHED-A | 717 | 977 |
SB | SCHED-B | 102 | 130 |
SC | SCHED-C | 177 | 321 |
SD | SCHED-D | 311 | 436 |
SE | SCHED-E | 64 | 70 |
SF | SCHED-F | 92 | 116 |
SG | SCHED-G | 301 | 482 |
SH | SCHED-H | 669 | 863 |
SI | SCHED-I | 64 | 102 |
SJ | SCHED-J | 84 | 121 |
SK | SCHED-K | 144 | 151 |
SL | SCHED-L | 82 | 108 |
SM | SCHED-M | 197 | 315 |
SN | SCHED-N | 156 | 220 |
SO | SCHED-O | 6 | 6 |
SR | SCHED-R | 328 | 417 |
Reports the location of the field on the 990 forms for ease of look-up and to organize variables by groups.
Note that PZ fields occur on both the 990-PC and 990-EZ forms. The part code references the location on the full 990-PC form, which will rarely be in the same place on the 990-EZ form. If you need the location of PZ variables on the EZ form use the location codes instead.
Field data type, derived from IRS XSD schema files.
Data_Type | Frequency |
---|---|
4428 | |
USAmountType | 1638 |
BooleanType | 764 |
USAmountNNType | 575 |
CheckboxType | 530 |
StreetAddressType | 233 |
IntegerNNType | 215 |
LineExplanationType | 209 |
TextType | 172 |
BusinessNameLine1Type | 140 |
BusinessNameLine2Type | 119 |
StateType | 96 |
RatioType | 88 |
CityType | 75 |
CountryType | 75 |
ExplanationType | 73 |
ZIPCodeType | 73 |
ShortExplanationType | 52 |
PersonNameType | 50 |
StringType | 49 |
EINType | 27 |
LargeRatioType | 17 |
CountType | 16 |
DateType | 15 |
ShortDescriptionType | 13 |
IntegerType | 10 |
PhoneNumberType | 10 |
Count2Type | 8 |
YearType | 6 |
xsd:decimal | 5 |
DecimalNNType | 4 |
BusinessNameControlType | 2 |
CUSIPNumberType | 2 |
PersonTitleType | 2 |
PTINType | 2 |
SSNType | 2 |
TimestampType | 2 |
AlphaNumericType | 1 |
InCareOfNameType | 1 |
Is the specific field required for the particular filer in order to submit your 990 data to the IRS?
The e-filing system provides some validation to ensure necessary fields are complete. It may not be strictly enforced, though.
Definition of the variable relationship to the filing nonprofit.
Examples include nonprofit name and nonprofit EIN - values that are unique and occur once on the form.
Examples include grants given, names of board members, and program accomplishments - anything that might occur multiple times on a 990 return.
This field (not yet implemented) provides a relational database structure for the data. Many fields in the dataset are one-to-one, meaning for each nonprofit there will be a unique value. These fields appear on the main table for the form.
Some fields have a one-to-many relationship. There are many board members for each nonprofit, there are several program activities reported, etc. In these cases, each new table defines a set of fields that function together. A compensation table. A nonprofit activities table. Etc.
The names of tables will follow this convention:
HEADER - Table of unique (OTO) header data for each nonprofit / foundation PC-OTO-TABLE_NAME - One-to-one table from the 990-PC form SA-OTM-TABLE_NAME - One-to-many table from Schedule A
Every table should include the following header variables to ensure they can be joined properly (these variables are designated by the table name “HEADER”):
In the future consider adding the variables to tables for ease of subsetting:
The “address” of the variable