In this first script we will compile all raw data and create three separate files:

  1. NONPROFIT-ADDRESSES-2014-2019.rds: Data file with complete nonprofit data
  2. NONPROFIT-2014-2019.rds: Nonprofit dataset with addressses and other essential information
  3. PEOPLE-2014-2019.rds: Board member dataset with addresses and other essential information

Files 2 and 3 will then be enhanced with geolocation data.

1. Downloading raw data and creating a comprenhensive dataset file

Data files have been previously manually downloaded from the following address under the heading “Files by Year”:

https://www.irs.gov/charities-non-profits/exempt-organizations-form-1023ez-approvals

Standard download format is .xlsx and the data is provided by year.

1.2. Binding all rds files into a a single comprenhensive rds

Before we bind, we need to adjust some variables. All variables remain consistent until 2018, where some additional variables are added.

## [1] "Gamingactyno...91"   "Gamingactyyes...92"  "Mission"            
## [4] "Gamingactyno...103"  "Gamingactyyes...104" "HospitalOrChurchNo" 
## [7] "HospitalOrChurchYes" "EZVersionNumber"
## [1] "Gamingactyno"  "Gamingactyyes"

2018 datafile has duplicated “Gamingactyno” and “Gamingactyyes” columns, note that the excel inport process has added the column number to the name to avoid duplicate var names. In addition, we have four new variables:

  • “Mission”
  • “HospitalOrChurchNo”
  • “HospitalOrChurchYes”
  • “EZVersionNumber”

“Gamingactyno” and “Gamingactyyes” in 2018 datafile both have identical data.

## [1] TRUE
## [1] TRUE

2019 has the same additional variables as 2018

## [1] "Gamingactyno...91"   "Gamingactyyes...92"  "Mission"            
## [4] "Gamingactyno...103"  "Gamingactyyes...104" "HospitalOrChurchNo" 
## [7] "HospitalOrChurchYes" "EZVersionNumber"
## [1] "Gamingactyno"  "Gamingactyyes"

However, in 2019 the duplicated variables hold different values.

## [1] FALSE
  • 11559 discrepancies and 44587 identical values between the variales.
  • 0 NAs in Gamingactyno…91 variable and 0 NAs in Gamingactyno…103 variable
## [1] FALSE
  • 0 discrepancies and 44587 identical values between the variales
  • 0 NAs in Gamingactyyes…92 variable and 11559 NAs in Gamingactyyes…104 variable

For binding with the data of previous years, the name of one of the duplicated columns will be set back to its original name (removing the column #). We will leave the duplicated variables present in 2018 and 2019 data and will rename them with a “.1” at the end.

Now, binding all years into one single database

1.3. Fixing some variables and saving 2016-2019 dataset rds file

Data set has no gaps in the id variables

  • EIN variable has 0 NAs
  • Case.Number variable has 0 NAs

Current dataset has two variables for ORGNAME and they seem to be a single name split into the two vars.

Orgname1 has 0 NAs, while Orgname2 has 200780, NAs.

0 blanks in Orgname1" and 0 blanks in Orgname2

Looking further check to make sure the variables work when combined:

Merging Orgname 1 and 2 to create variable ORGNAME

These are some orgname1 values before merging data in orgname2

After merging

Some minor changes to variables and removing redundant information

The dataset has 263272 nonprofits, and 112 variables.

2. Building a Nonprofit (NPO) Addresses dataset

Using the NONPROFIT-ADDRESSES-2014-2019.rds dataset to make a new file with only the NPO data necessary.

These will be dropped:

##  [1] "Primarycontactname"      "Primarycontactphone"    
##  [3] "Primarycontactphoneext"  "Primarycontactfax"      
##  [5] "Ofcrdirtrust1firstname"  "Ofcrdirtrust1lastname"  
##  [7] "Ofcrdirtrust1title"      "Ofcrdirtrust1streetaddr"
##  [9] "Ofcrdirtrust1city"       "Ofcrdirtrust1state"     
## [11] "Ofcrdirtrust1zip"        "Ofcrdirtrust1zippl4"    
## [13] "Ofcrdirtrust2firstname"  "Ofcrdirtrust2lastname"  
## [15] "Ofcrdirtrust2title"      "Ofcrdirtrust2streetaddr"
## [17] "Ofcrdirtrust2city"       "Ofcrdirtrust2state"     
## [19] "Ofcrdirtrust2zip"        "Ofcrdirtrust2zippl4"    
## [21] "Ofcrdirtrust3firstname"  "Ofcrdirtrust3lastname"  
## [23] "Ofcrdirtrust3title"      "Ofcrdirtrust3streetaddr"
## [25] "Ofcrdirtrust3city"       "Ofcrdirtrust3state"     
## [27] "Ofcrdirtrust3zip"        "Ofcrdirtrust3zippl4"    
## [29] "Ofcrdirtrust4firstname"  "Ofcrdirtrust4lastname"  
## [31] "Ofcrdirtrust4title"      "Ofcrdirtrust4streetaddr"
## [33] "Ofcrdirtrust4city"       "Ofcrdirtrust4state"     
## [35] "Ofcrdirtrust4zip"        "Ofcrdirtrust4zippl4"    
## [37] "Ofcrdirtrust5firstname"  "Ofcrdirtrust5lastname"  
## [39] "Ofcrdirtrust5title"      "Ofcrdirtrust5streetaddr"
## [41] "Ofcrdirtrust5city"       "Ofcrdirtrust5state"     
## [43] "Ofcrdirtrust5zip"        "Ofcrdirtrust5zippl4"    
## [45] "Orgemail"

Adding a Year var, and removing redundant data

We have 263272 cases with ID, of which 263138 are unique.

Exploring duplicates: how many? and how many repetitions?

The dataset had ID-2019-123456789 repeated 6 times with different npo information The EIN looks made up (1-2-3-4…)

ID-2019-833992638 is repeated 3 times, with different npo information.

The rest are duplicates with what appears the same npo information but different Case.Number and Signaturedate.

Take, for example, cases 10 and 11, they differ in those variables:

Or case 12 and 13:

Given ID is not unique, we will create a key variable that is unique

Saving rds file

The dataset has 263272 nonprofits, and 70 variables.

3. Building a Board Members (PPL) Addresses dataset

Using the NONPROFIT-ADDRESSES-2014-2019.rds dataset to create a new dataset with key Board Members information.

Code chunk to upload the rds file - in case required

NONPROFIT-ADDRESSES-2014-2019.rds dataset is structured with a nonprofit for each row, holding informatino for five NPO members in mutiple columns. In the following code chunk we are reshaping that dataset so that each row is a board member with a unique ID.

Current ID variable for NPOs is ID-YEAR-EIN. We need to create a new ID variable which is unique for each board member in the same org. To make this ID variable we will add a count for each member, in the order reported on the 1023 forms.

Like this: ID-YEAR-EIN-##

This ID will be used when geo-coding as reference to re-join the geocoded data back to the original data.

#first member listed in the npos
d1 <- 
  dat %>%
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,  
          Ofcrdirtrust1firstname, Ofcrdirtrust1lastname, 
          Ofcrdirtrust1title, Ofcrdirtrust1streetaddr, 
          Ofcrdirtrust1city, Ofcrdirtrust1state, 
          Ofcrdirtrust1zip, Ofcrdirtrust1zippl4 ) 

#standardizing the variable names for binding
nmz <- names(d1)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d1 ) <- nmz

#adding board member #
d1$ID <- paste0( d1$ID, "-01" )

#second member listed in the npos
d2 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,
          Ofcrdirtrust2firstname, Ofcrdirtrust2lastname, 
          Ofcrdirtrust2title, Ofcrdirtrust2streetaddr, 
          Ofcrdirtrust2city, Ofcrdirtrust2state, 
          Ofcrdirtrust2zip, Ofcrdirtrust2zippl4 )
nmz <- names(d2)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d2 ) <- nmz
d2$ID <- paste0( d2$ID, "-02" )

#third member listed in the npos
d3 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,
          Ofcrdirtrust3firstname, Ofcrdirtrust3lastname, 
          Ofcrdirtrust3title, Ofcrdirtrust3streetaddr, 
          Ofcrdirtrust3city, Ofcrdirtrust3state,
          Ofcrdirtrust3zip, Ofcrdirtrust3zippl4 )
nmz <- names(d3)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d3 ) <- nmz
d3$ID <- paste0( d3$ID, "-03" )

#fourth member listed in the npos
d4 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number, 
          Ofcrdirtrust4firstname, Ofcrdirtrust4lastname, 
          Ofcrdirtrust4title, Ofcrdirtrust4streetaddr, 
          Ofcrdirtrust4city, Ofcrdirtrust4state, 
          Ofcrdirtrust4zip, Ofcrdirtrust4zippl4 )
nmz <- names(d4)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d4 ) <- nmz
d4$ID <- paste0( d4$ID, "-04" )

#fifth member listed in the npos
d5 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number, 
          Ofcrdirtrust5firstname, Ofcrdirtrust5lastname, 
          Ofcrdirtrust5title, Ofcrdirtrust5streetaddr, 
          Ofcrdirtrust5city, Ofcrdirtrust5state, 
          Ofcrdirtrust5zip, Ofcrdirtrust5zippl4 )
nmz <- names(d5)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d5 ) <- nmz
d5$ID <- paste0( d5$ID, "-05" )

#binding all data
ppl <- bind_rows( d1, d2, d3, d4, d5 ) #all people data as individual cases.

Dropping empty cases and redundant data, adding a YR var, changing some var names and arranging the order

2014 2015 2016 2017 2018 2019
56945 155945 172335 175196 192512 193160
##  [1] "ID"                     "ORGNAME"                "EIN"                   
##  [4] "Signaturedate"          "Case.Number"            "Ofcrdirtrustfirstname" 
##  [7] "Ofcrdirtrustlastname"   "Ofcrdirtrusttitle"      "Ofcrdirtruststreetaddr"
## [10] "Ofcrdirtrustcity"       "Ofcrdirtruststate"      "Ofcrdirtrustzip"       
## [13] "Ofcrdirtrustzippl4"     "YR"
##  [1] "ID"            "ORGNAME"       "EIN"           "Signaturedate"
##  [5] "Case.Number"   "Firstname"     "Lastname"      "Title"        
##  [9] "Address"       "City"          "State"         "Zip"          
## [13] "Zippl4"        "YR"

Now we will add a Gender variable. We use the Gender package, which estimates the gender of names using historicl data.

either female male NA
0 0.49 0.42 0.09

Ordering variables

Checking to see that ID variable is unique…

##  [1] "ID"              "ORGNAME"         "EIN"             "YR"             
##  [5] "Signaturedate"   "Case.Number"     "Firstname"       "Lastname"       
##  [9] "Title"           "Address"         "City"            "State"          
## [13] "Zip"             "Zippl4"          "gender"          "proportion_male"
## [17] "IDdup"

Given ID is not unique, we will create a key variable that is unique

##  [1] "ID"              "ORGNAME"         "EIN"             "YR"             
##  [5] "Signaturedate"   "Case.Number"     "Firstname"       "Lastname"       
##  [9] "Title"           "Address"         "City"            "State"          
## [13] "Zip"             "Zippl4"          "gender"          "proportion_male"
## [17] "IDdup"           "key"

Saving rds file

The dataset has 946093 nonprofits, and 18 variables.