• Packages
  • 1. Downloading raw data and creating a comprenhensive dataset file
    • 1.1. Transforming xlsx files into rds.
    • 1.2. Binding all rds files into a a single comprenhensive rds
    • 1.3. Fixing some variables and saving 2016-2019 dataset rds file
  • 2. Building a Nonprofit (NPO) Addresses dataset
  • 3. Building a Board Members (PPL) Addresses dataset

In this first script we will compile all raw data and create three separate files:

  1. NONPROFIT-ADDRESSES-2014-2019.rds: Data file with complete nonprofit data
  2. NONPROFIT-2014-2019.rds: Nonprofit dataset with addressses and other essential information
  3. PEOPLE-2014-2019.rds: Board member dataset with addresses and other essential information

Files 2 and 3 will then be enhanced with geolocation data.

1. Downloading raw data and creating a comprenhensive dataset file

Data files have been previously manually downloaded from the following address under the heading “Files by Year”:

https://www.irs.gov/charities-non-profits/exempt-organizations-form-1023ez-approvals

Standard download format is .xlsx and the data is provided by year.

1.2. Binding all rds files into a a single comprenhensive rds

Before we bind, we need to adjust some variables. All variables remain consistent until 2018, where some additional variables are added.

## [1] "Gamingactyno...91"   "Gamingactyyes...92"  "Mission"            
## [4] "Gamingactyno...103"  "Gamingactyyes...104" "HospitalOrChurchNo" 
## [7] "HospitalOrChurchYes" "EZVersionNumber"
## [1] "Gamingactyno"  "Gamingactyyes"

2018 datafile has duplicated “Gamingactyno” and “Gamingactyyes” columns, note that the excel inport process has added the column number to the name to avoid duplicate var names. In addition, we have four new variables:

  • “Mission”
  • “HospitalOrChurchNo”
  • “HospitalOrChurchYes”
  • “EZVersionNumber”

“Gamingactyno” and “Gamingactyyes” in 2018 datafile both have identical data.

## [1] TRUE
## [1] TRUE

2019 has the same additional variables as 2018

## [1] "Gamingactyno...91"   "Gamingactyyes...92"  "Mission"            
## [4] "Gamingactyno...103"  "Gamingactyyes...104" "HospitalOrChurchNo" 
## [7] "HospitalOrChurchYes" "EZVersionNumber"
## [1] "Gamingactyno"  "Gamingactyyes"

However, in 2019 the duplicated variables hold different values.

## [1] FALSE
  • 11559 discrepancies and 44587 identical values between the variales.
  • 0 NAs in Gamingactyno…91 variable and 0 NAs in Gamingactyno…103 variable
## [1] FALSE
  • 0 discrepancies and 44587 identical values between the variales
  • 0 NAs in Gamingactyyes…92 variable and 11559 NAs in Gamingactyyes…104 variable

For binding with the data of previous years, the name of one of the duplicated columns will be set back to its original name (removing the column #). We will leave the duplicated variables present in 2018 and 2019 data and will rename them with a “.1” at the end.

Now, binding all years into one single database

1.3. Fixing some variables and saving 2016-2019 dataset rds file

Data set has no gaps in the id variables

  • EIN variable has 0 NAs
  • Case.Number variable has 0 NAs

Current dataset has two variables for ORGNAME and they seem to be a single name split into the two vars.

ABCDEFGHIJ0123456789
Orgname1
<chr>
Orgname2
<chr>
GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION
MAINE WELFARE DIRECTORS ASSOCIATIONNA
MAINE MYCOLOGICAL ASSOCIATION INCNA
BONNY EAGLE BOYS BASKETBALLNA
BUCKSPORT AREA CULTURAL ARTS SOCIETY- BACAS

Orgname1 has 0 NAs, while Orgname2 has 200780, NAs.

0 blanks in Orgname1" and 0 blanks in Orgname2

Looking further check to make sure the variables work when combined:

ABCDEFGHIJ0123456789
Orgname1
<chr>
Orgname2
<chr>
Org2len
<int>
THE HERSHEL WOODY WILLIAMS CONGRESSIONAL MEDAL OF HONOR EDUC FOUND INC35
MOST WORSHIPFUL DORIC GRAND LODGE OF ANCIENT FREE AND ACCEPTED MASON I35
BLUE WATER JAZZ SOCIETY REV PROC 2014 11 REINSTATE POSTMARK DATE35
WEST CHESTER OF PENNSYLVANIA CHAPTER INSTITUTE OF MANAGEMENT ACCOUNTAN35
FUTURE FARMERS OF AMERICA STATE 29002 ATLANTIC COUNTY FFA ALUMNI ASSOC35
THE FRIENDS OF THE TUSCARAWAS CO PUBLIC LIBRARY--NEW PHILADELPHIA LOCA35
THE WOMENS CEO GLOBAL ALLIANCE-A BEHAVIORAL HEALTH LEADERSHIP ROUNDTAB35
AMERICAN ASSOCIATION OF BLACKS IN ENERGY CHICAGO METROPOLITAN CHAPTER35
THE LAUDERDALE COUNTY SPECIAL PROGRAMMING FOR ACHIEVEMENT NETWORK SPAN35
HERNANDO COUNTY FIRE CORPS STREAMLINED RETROACTIVE REINSTATMENT35

Merging Orgname 1 and 2 to create variable ORGNAME

These are some orgname1 values before merging data in orgname2

ABCDEFGHIJ0123456789
Orgname1
<chr>
Orgname2
<chr>
GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION
BUCKSPORT AREA CULTURAL ARTS SOCIETY- BACAS
NEW RICHLAND AREA HISTORICAL SOCIETY
SPECIAL IMPROVEMENT DISTRICT OF NETCONG BOROUGH INC
KENNESAW MOUNTAIN HIGH SCHOOL TRACKFIELD BOOSTER CLUB INC
TRI-ARC COMMUNITY DEVELOPMENT CORPORATION
WALNUT BEND ELEMENTARY SCHOOL PARENT-TEACHER ORGANIZATION
THE LANCASTER COUNTY JUNIOR GOLF TOUR
KINGDOM DEVELOPMENT CORPORATION OFBAHAMA
SOUTHERN MIDDLE SCHOOL BAND BOOSTERS

After merging

ABCDEFGHIJ0123456789
ORGNAME
<chr>
GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION
BUCKSPORT AREA CULTURAL ARTS SOCIETY- BACAS
NEW RICHLAND AREA HISTORICAL SOCIETY
SPECIAL IMPROVEMENT DISTRICT OF NETCONG BOROUGH INC
KENNESAW MOUNTAIN HIGH SCHOOL TRACKFIELD BOOSTER CLUB INC
TRI-ARC COMMUNITY DEVELOPMENT CORPORATION
WALNUT BEND ELEMENTARY SCHOOL PARENT-TEACHER ORGANIZATION
THE LANCASTER COUNTY JUNIOR GOLF TOUR
KINGDOM DEVELOPMENT CORPORATION OFBAHAMA
SOUTHERN MIDDLE SCHOOL BAND BOOSTERS

Some minor changes to variables and removing redundant information

The dataset has 263272 nonprofits, and 112 variables.

ABCDEFGHIJ0123456789
EIN
<chr>
Case.Number
<chr>
Formrevision
<chr>
Eligibilityworksheet
<chr>
010278788EO-2014328-0001220620141
010468034EO-2014328-0002780620141
010512631EO-2014343-0000770620141
010513008EO-2014318-0002000620141
010531578EO-2014245-0004050620141
010533351EO-2014294-0002050620141
010572607EO-2014276-0003870620141
010584691EO-2014269-0001380620141
010601805EO-2014251-0002830620141
010604316EO-2014246-0002710620141

2. Building a Nonprofit (NPO) Addresses dataset

Using the NONPROFIT-ADDRESSES-2014-2019.rds dataset to make a new file with only the NPO data necessary.

These will be dropped:

##  [1] "Primarycontactname"      "Primarycontactphone"    
##  [3] "Primarycontactphoneext"  "Primarycontactfax"      
##  [5] "Ofcrdirtrust1firstname"  "Ofcrdirtrust1lastname"  
##  [7] "Ofcrdirtrust1title"      "Ofcrdirtrust1streetaddr"
##  [9] "Ofcrdirtrust1city"       "Ofcrdirtrust1state"     
## [11] "Ofcrdirtrust1zip"        "Ofcrdirtrust1zippl4"    
## [13] "Ofcrdirtrust2firstname"  "Ofcrdirtrust2lastname"  
## [15] "Ofcrdirtrust2title"      "Ofcrdirtrust2streetaddr"
## [17] "Ofcrdirtrust2city"       "Ofcrdirtrust2state"     
## [19] "Ofcrdirtrust2zip"        "Ofcrdirtrust2zippl4"    
## [21] "Ofcrdirtrust3firstname"  "Ofcrdirtrust3lastname"  
## [23] "Ofcrdirtrust3title"      "Ofcrdirtrust3streetaddr"
## [25] "Ofcrdirtrust3city"       "Ofcrdirtrust3state"     
## [27] "Ofcrdirtrust3zip"        "Ofcrdirtrust3zippl4"    
## [29] "Ofcrdirtrust4firstname"  "Ofcrdirtrust4lastname"  
## [31] "Ofcrdirtrust4title"      "Ofcrdirtrust4streetaddr"
## [33] "Ofcrdirtrust4city"       "Ofcrdirtrust4state"     
## [35] "Ofcrdirtrust4zip"        "Ofcrdirtrust4zippl4"    
## [37] "Ofcrdirtrust5firstname"  "Ofcrdirtrust5lastname"  
## [39] "Ofcrdirtrust5title"      "Ofcrdirtrust5streetaddr"
## [41] "Ofcrdirtrust5city"       "Ofcrdirtrust5state"     
## [43] "Ofcrdirtrust5zip"        "Ofcrdirtrust5zippl4"    
## [45] "Orgemail"

Adding a Year var, and removing redundant data

We have 263272 cases with ID, of which 263138 are unique.

Exploring duplicates: how many? and how many repetitions?

ABCDEFGHIJ0123456789
 
 
ID
<chr>
IDdup
<int>
1ID-2019-1234567896
2ID-2019-8339926383
3ID-2014-3007425962
4ID-2014-4648085712
5ID-2014-4650088402
6ID-2014-4715455372

The dataset had ID-2019-123456789 repeated 6 times with different npo information The EIN looks made up (1-2-3-4…)

ABCDEFGHIJ0123456789
ID
<chr>
ORGNAME
<chr>
ID-2019-123456789MEMORIAL 3GUN FOUNDATION INC
ID-2019-123456789NORTH AMERICAN - ISRAELI PULMONARYALLIANCE INC
ID-2019-123456789BABYS FIRST NEEDS INC
ID-2019-123456789CITY GATES INITIATIVE
ID-2019-123456789BARREN COUNTY FOSTER AND ADOPTIVE PARENT ASSOCIATION INC
ID-2019-123456789ISPARKLE2 INC

ID-2019-833992638 is repeated 3 times, with different npo information.

ABCDEFGHIJ0123456789
ID
<chr>
ORGNAME
<chr>
ID-2019-833992638A BEAUTIFUL ME A BEAUTIFUL YOU INC
ID-2019-833992638BTF FOUNDATION
ID-2019-833992638THREAT FILLING STATION

The rest are duplicates with what appears the same npo information but different Case.Number and Signaturedate.

Take, for example, cases 10 and 11, they differ in those variables:

ABCDEFGHIJ0123456789
Var
<chr>
Case1
<chr>
Case2
<chr>
IsitEqual
<lgl>
Case.NumberEO-2014267-000383EO-2014283-000155FALSE
Signaturedate14113440001412726400FALSE

Or case 12 and 13:

ABCDEFGHIJ0123456789
Var
<chr>
Case1
<chr>
Case2
<chr>
IsitEqual
<lgl>
Case.NumberEO-2014254-000047EO-2014268-000464FALSE
Signaturedate14100480001411430400FALSE

Given ID is not unique, we will create a key variable that is unique

Saving rds file

The dataset has 263272 nonprofits, and 70 variables.

ABCDEFGHIJ0123456789
ID
<chr>
key
<int>
ORGNAME
<chr>
ID-2014-0102787881GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION
ID-2014-0104680342MAINE WELFARE DIRECTORS ASSOCIATION
ID-2014-0105126313MAINE MYCOLOGICAL ASSOCIATION INC
ID-2014-0105130084BONNY EAGLE BOYS BASKETBALL
ID-2014-0105315785BUCKSPORT AREA CULTURAL ARTS SOCIETY- BACAS
ID-2014-0105333516PORTLAND RUGBY FOOTBALL CLUB
ID-2014-0105726077NEW RICHLAND AREA HISTORICAL SOCIETY
ID-2014-0105846918SPECIAL IMPROVEMENT DISTRICT OF NETCONG BOROUGH INC
ID-2014-0106018059KIRTLAND KIWANIS FOUNDATION
ID-2014-01060431610KENNESAW MOUNTAIN HIGH SCHOOL TRACKFIELD BOOSTER CLUB INC

3. Building a Board Members (PPL) Addresses dataset

Using the NONPROFIT-ADDRESSES-2014-2019.rds dataset to create a new dataset with key Board Members information.

Code chunk to upload the rds file - in case required

NONPROFIT-ADDRESSES-2014-2019.rds dataset is structured with a nonprofit for each row, holding informatino for five NPO members in mutiple columns. In the following code chunk we are reshaping that dataset so that each row is a board member with a unique ID.

Current ID variable for NPOs is ID-YEAR-EIN. We need to create a new ID variable which is unique for each board member in the same org. To make this ID variable we will add a count for each member, in the order reported on the 1023 forms.

Like this: ID-YEAR-EIN-##

This ID will be used when geo-coding as reference to re-join the geocoded data back to the original data.

#first member listed in the npos
d1 <- 
  dat %>%
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,  
          Ofcrdirtrust1firstname, Ofcrdirtrust1lastname, 
          Ofcrdirtrust1title, Ofcrdirtrust1streetaddr, 
          Ofcrdirtrust1city, Ofcrdirtrust1state, 
          Ofcrdirtrust1zip, Ofcrdirtrust1zippl4 ) 

#standardizing the variable names for binding
nmz <- names(d1)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d1 ) <- nmz

#adding board member #
d1$ID <- paste0( d1$ID, "-01" )

#second member listed in the npos
d2 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,
          Ofcrdirtrust2firstname, Ofcrdirtrust2lastname, 
          Ofcrdirtrust2title, Ofcrdirtrust2streetaddr, 
          Ofcrdirtrust2city, Ofcrdirtrust2state, 
          Ofcrdirtrust2zip, Ofcrdirtrust2zippl4 )
nmz <- names(d2)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d2 ) <- nmz
d2$ID <- paste0( d2$ID, "-02" )

#third member listed in the npos
d3 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number,
          Ofcrdirtrust3firstname, Ofcrdirtrust3lastname, 
          Ofcrdirtrust3title, Ofcrdirtrust3streetaddr, 
          Ofcrdirtrust3city, Ofcrdirtrust3state,
          Ofcrdirtrust3zip, Ofcrdirtrust3zippl4 )
nmz <- names(d3)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d3 ) <- nmz
d3$ID <- paste0( d3$ID, "-03" )

#fourth member listed in the npos
d4 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number, 
          Ofcrdirtrust4firstname, Ofcrdirtrust4lastname, 
          Ofcrdirtrust4title, Ofcrdirtrust4streetaddr, 
          Ofcrdirtrust4city, Ofcrdirtrust4state, 
          Ofcrdirtrust4zip, Ofcrdirtrust4zippl4 )
nmz <- names(d4)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d4 ) <- nmz
d4$ID <- paste0( d4$ID, "-04" )

#fifth member listed in the npos
d5 <- 
  dat %>% 
  select( ID, ORGNAME, EIN, Signaturedate, Case.Number, 
          Ofcrdirtrust5firstname, Ofcrdirtrust5lastname, 
          Ofcrdirtrust5title, Ofcrdirtrust5streetaddr, 
          Ofcrdirtrust5city, Ofcrdirtrust5state, 
          Ofcrdirtrust5zip, Ofcrdirtrust5zippl4 )
nmz <- names(d5)
nmz <- gsub( "Ofcrdirtrust[1-9]", "Ofcrdirtrust", nmz )
names( d5 ) <- nmz
d5$ID <- paste0( d5$ID, "-05" )

#binding all data
ppl <- bind_rows( d1, d2, d3, d4, d5 ) #all people data as individual cases.

Dropping empty cases and redundant data, adding a YR var, changing some var names and arranging the order

2014 2015 2016 2017 2018 2019
56945 155945 172335 175196 192512 193160
##  [1] "ID"                     "ORGNAME"                "EIN"                   
##  [4] "Signaturedate"          "Case.Number"            "Ofcrdirtrustfirstname" 
##  [7] "Ofcrdirtrustlastname"   "Ofcrdirtrusttitle"      "Ofcrdirtruststreetaddr"
## [10] "Ofcrdirtrustcity"       "Ofcrdirtruststate"      "Ofcrdirtrustzip"       
## [13] "Ofcrdirtrustzippl4"     "YR"
##  [1] "ID"            "ORGNAME"       "EIN"           "Signaturedate"
##  [5] "Case.Number"   "Firstname"     "Lastname"      "Title"        
##  [9] "Address"       "City"          "State"         "Zip"          
## [13] "Zippl4"        "YR"

Now we will add a Gender variable. We use the Gender package, which estimates the gender of names using historicl data.

ABCDEFGHIJ0123456789
name
<chr>
gender
<chr>
proportion_male
<dbl>
AADImale0.9809
AADITYAmale1.0000
AAHANmale1.0000
AAKANKSHAfemale0.0000
AAKASHmale1.0000
AALEYAHfemale0.0000
either female male NA
0 0.49 0.42 0.09

Ordering variables

Checking to see that ID variable is unique…

ABCDEFGHIJ0123456789
 
 
Var1
<fctr>
Freq
<int>
753793ID-2019-123456789-016
753794ID-2019-123456789-025
753795ID-2019-123456789-035
753796ID-2019-123456789-044
868218ID-2019-833992638-013
868219ID-2019-833992638-023
##  [1] "ID"              "ORGNAME"         "EIN"             "YR"             
##  [5] "Signaturedate"   "Case.Number"     "Firstname"       "Lastname"       
##  [9] "Title"           "Address"         "City"            "State"          
## [13] "Zip"             "Zippl4"          "gender"          "proportion_male"
## [17] "IDdup"

Given ID is not unique, we will create a key variable that is unique

##  [1] "ID"              "ORGNAME"         "EIN"             "YR"             
##  [5] "Signaturedate"   "Case.Number"     "Firstname"       "Lastname"       
##  [9] "Title"           "Address"         "City"            "State"          
## [13] "Zip"             "Zippl4"          "gender"          "proportion_male"
## [17] "IDdup"           "key"

Saving rds file

The dataset has 946093 nonprofits, and 18 variables.

ABCDEFGHIJ0123456789
ID
<chr>
key
<int>
ORGNAME
<chr>
EIN
<chr>
YR
<chr>
ID-2014-010278788-011GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION0102787882014
ID-2014-010278788-022GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION0102787882014
ID-2014-010278788-033GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION0102787882014
ID-2014-010278788-044GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION0102787882014
ID-2014-010278788-055GOODWINS MILLS FIREFIGHTERS RELIEFASSOCIATION0102787882014
ID-2014-010468034-016MAINE WELFARE DIRECTORS ASSOCIATION0104680342014
ID-2014-010468034-027MAINE WELFARE DIRECTORS ASSOCIATION0104680342014
ID-2014-010468034-038MAINE WELFARE DIRECTORS ASSOCIATION0104680342014
ID-2014-010468034-049MAINE WELFARE DIRECTORS ASSOCIATION0104680342014
ID-2014-010468034-0510MAINE WELFARE DIRECTORS ASSOCIATION0104680342014