Introduction

In this Report we will cover:

  1. Description of dataset
  2. Process of developing the dataset
  3. Summary of data gaps, consistency tests and disambiguation
  4. Ideas on further questions and potential enhancements

Loading packages and files

#setting up the environment
library( dplyr )
library( tidyr )
library( pander )

#update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd)

1. Description of dataset

1.1 General description

Sources, key variables, dimensions, etc. Nonprofit Dataset

For more information on the variables see the Data Dictionary [PENDING]

Dataset development/augmentation process

Dataset Development Process

1023-EZ data has been used to generate two datasets:

  • Nonprofit dataset: each observation is a nonprofit organization (approximately )
  • People dataset: each observation is an organization member (approximately )

The datasets have been manipulated to remove any sensitive information and to include the following enhancements:

  • Nonprofit mission and taxonomy classification
  • Geographic data (geocodes) for nonprofit addreses and organization member addresses (latitudes and longitudes.
  • Census tract corresponding to geocodes and key demographic census data from the corresponding nonprofit and organization member addresses.
  • Voting district correspondnig to geocodes and historic voting results of the district.
  • Gender of organization members (infered from historic us name data).

Main Characteristics

a) IDs are repeated so we used input_address as key for the census and google geocoding.

To make geocoding more efficient, we used input_address as the key.

  • Some cases with duplicate IDs that have the same address written in a slightly different way. Might be useful to understand why the process allows two addresses to be filled - it could be the same business registering twice?

1.2 Data Sources

The Nonprofit datasets (Nonprofit and Board Members) have been generated based on IRS’ 1023-EZ forms for year 2014 to 2019, which are publicly available here.

Nonprofits submit 1023-EZ forms to apply for recognition as a tax-exempt organization under Section 501(c)(3) of the Internal Revenue Code, for more deatils visit the IRS website. Only organizations that meet certain characteristics are elegible, some of the main requierements are:

  • Annual gross receipts for the past 3 years and expected gross receipts for thre next 3 years does not exceed $50k
  • Assets less than $250,000
  • Mailing address inside US territory

For more detail see p. 13 of the 1023-EZ application instructions

According to the 1023-EZ instructions, nonprofits must submit detailed organizaiton information, including: the organization’s name, mailing address, employee identification number (EIN), mission, contact person, and name, title and mailing address information of up to five organization officers, directors, and/or trustees. 1023-EZ instructions prioritize reporting of organization members in the following order:

  1. President or chief executive officer or chief operating officer.
  2. Treasurer or chief financial officer.
  3. Chairperson of the governing body.
  4. Any officers, directors, and trustees who are substantial contributors (not already listed above).
  5. Any other officers, directors, and trustees who are related to a substantial contributor (not already listed above).
  6. Voting members of the governing body (not already listed above).
  7. Officers (not already listed above).

If an individual serves in more than one office (for example, as both an officer and director), list this individual on only one line and list all offices held.

An officer is a person elected or appointed to manage the organization’s daily operations, such as president, vice president, secretary, treasurer, and, in some cases, board chair. The officers of an organization are determined by reference to its organizing document, bylaws, or resolutions of its governing body, or otherwise designated consistent with state law. A director or trustee is a member of the organization’s governing body, but only if the member has voting rights.

1.3 Dataset Representativeness and Relevance

Do the e-filers represent all or most of 1023-EZ filers? What proportion of 1023 filers are EZ filers? What proportion of small nonprofits are recent 1023-EZ filers? Are EZ filers representative of all 1023 filers? If not, what is particular about this sample?

[Pending]

1.4 Descriptive Statistics

Loading the data

npo1 <- readRDS("Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds")
npo2 <- readRDS("Data/4_GeoGoogle/NONPROFITS-2014-2019v4.rds")
npo3 <- readRDS("Data/5_ZipandCity/NONPROFITS-2014-2019v5.rds")

names(npo1)
##  [1] "ID"                          "key"                        
##  [3] "ORGNAME"                     "EIN"                        
##  [5] "YR"                          "Mission"                    
##  [7] "Orgname1"                    "Orgname2"                   
##  [9] "Case.Number"                 "Formrevision"               
## [11] "Eligibilityworksheet"        "Address"                    
## [13] "City"                        "State"                      
## [15] "Zip"                         "Zippl4"                     
## [17] "Accountingperiodend"         "Userfeesubmitted"           
## [19] "Orgurl"                      "Orgtypecorp"                
## [21] "Orgtypeunincorp"             "Orgtypetrust"               
## [23] "Necessaryorgdocs"            "Incorporateddate"           
## [25] "Incorporatedstate"           "Containslimitation"         
## [27] "Doesnotexpresslyempower"     "Containsdissolution"        
## [29] "Nteecode"                    "Orgpurposecharitable"       
## [31] "Orgpurposereligious"         "Orgpurposeeducational"      
## [33] "Orgpurposescientific"        "Orgpurposeliterary"         
## [35] "Orgpurposepublicsafety"      "Orgpurposeamateursports"    
## [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"        
## [39] "Leginflno"                   "Leginflyes"                 
## [41] "Compofcrdirtrustno"          "Compofcrdirtrustyes"        
## [43] "Donatefundsno"               "Donatefundsyes"             
## [45] "Conductactyoutsideusno"      "Conductactyoutsideusyes"    
## [47] "Financialtransofcrsno"       "Financialtransofcrsyes"     
## [49] "Unrelgrossincm1000moreno"    "Unrelgrossincm1000moreyes"  
## [51] "Gamingactyno"                "Gamingactyyes"              
## [53] "Disasterreliefno"            "Disasterreliefyes"          
## [55] "Onethirdsupportpublic"       "Onethirdsupportgifts"       
## [57] "Benefitofcollege"            "Privatefoundation508e"      
## [59] "Seekingretroreinstatement"   "Seekingsec7reinstatement"   
## [61] "Gamingactyno.1"              "Gamingactyyes.1"            
## [63] "HospitalOrChurchNo"          "HospitalOrChurchYes"        
## [65] "Correctnessdeclaration"      "Signaturename"              
## [67] "Signaturetitle"              "Signaturedate"              
## [69] "EZVersionNumber"             "IDdup"                      
## [71] "pob"                         "add.len"                    
## [73] "input_address"               "match"                      
## [75] "match_type"                  "out_address"                
## [77] "lat_lon_cen"                 "tiger_line_id"              
## [79] "tiger_line_side"             "state_fips"                 
## [81] "county_fips"                 "tract_fips"                 
## [83] "block_fips"                  "lon_cen"                    
## [85] "lat_cen"                     "geocode_type"
names(npo2)
##  [1] "ID"                          "key"                        
##  [3] "ORGNAME"                     "EIN"                        
##  [5] "YR"                          "Mission"                    
##  [7] "Orgname1"                    "Orgname2"                   
##  [9] "Case.Number"                 "Formrevision"               
## [11] "Eligibilityworksheet"        "Address"                    
## [13] "City"                        "State"                      
## [15] "Zip"                         "Zippl4"                     
## [17] "Accountingperiodend"         "Userfeesubmitted"           
## [19] "Orgurl"                      "Orgtypecorp"                
## [21] "Orgtypeunincorp"             "Orgtypetrust"               
## [23] "Necessaryorgdocs"            "Incorporateddate"           
## [25] "Incorporatedstate"           "Containslimitation"         
## [27] "Doesnotexpresslyempower"     "Containsdissolution"        
## [29] "Nteecode"                    "Orgpurposecharitable"       
## [31] "Orgpurposereligious"         "Orgpurposeeducational"      
## [33] "Orgpurposescientific"        "Orgpurposeliterary"         
## [35] "Orgpurposepublicsafety"      "Orgpurposeamateursports"    
## [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"        
## [39] "Leginflno"                   "Leginflyes"                 
## [41] "Compofcrdirtrustno"          "Compofcrdirtrustyes"        
## [43] "Donatefundsno"               "Donatefundsyes"             
## [45] "Conductactyoutsideusno"      "Conductactyoutsideusyes"    
## [47] "Financialtransofcrsno"       "Financialtransofcrsyes"     
## [49] "Unrelgrossincm1000moreno"    "Unrelgrossincm1000moreyes"  
## [51] "Gamingactyno"                "Gamingactyyes"              
## [53] "Disasterreliefno"            "Disasterreliefyes"          
## [55] "Onethirdsupportpublic"       "Onethirdsupportgifts"       
## [57] "Benefitofcollege"            "Privatefoundation508e"      
## [59] "Seekingretroreinstatement"   "Seekingsec7reinstatement"   
## [61] "Gamingactyno.1"              "Gamingactyyes.1"            
## [63] "HospitalOrChurchNo"          "HospitalOrChurchYes"        
## [65] "Correctnessdeclaration"      "Signaturename"              
## [67] "Signaturetitle"              "Signaturedate"              
## [69] "EZVersionNumber"             "IDdup"                      
## [71] "pob"                         "add.len"                    
## [73] "input_address"               "match"                      
## [75] "match_type"                  "out_address"                
## [77] "lat_lon_cen"                 "tiger_line_id"              
## [79] "tiger_line_side"             "state_fips"                 
## [81] "county_fips"                 "tract_fips"                 
## [83] "block_fips"                  "lon_cen"                    
## [85] "lat_cen"                     "geocode_type"               
## [87] "lon_ggl"                     "lat_ggl"                    
## [89] "address_ggl"
names(npo3)
##   [1] "ID"                          "key"                        
##   [3] "ORGNAME"                     "EIN"                        
##   [5] "YR"                          "Mission"                    
##   [7] "Orgname1"                    "Orgname2"                   
##   [9] "Case.Number"                 "Formrevision"               
##  [11] "Eligibilityworksheet"        "Address"                    
##  [13] "City"                        "State"                      
##  [15] "Zip"                         "Zippl4"                     
##  [17] "Accountingperiodend"         "Userfeesubmitted"           
##  [19] "Orgurl"                      "Orgtypecorp"                
##  [21] "Orgtypeunincorp"             "Orgtypetrust"               
##  [23] "Necessaryorgdocs"            "Incorporateddate"           
##  [25] "Incorporatedstate"           "Containslimitation"         
##  [27] "Doesnotexpresslyempower"     "Containsdissolution"        
##  [29] "Nteecode"                    "Orgpurposecharitable"       
##  [31] "Orgpurposereligious"         "Orgpurposeeducational"      
##  [33] "Orgpurposescientific"        "Orgpurposeliterary"         
##  [35] "Orgpurposepublicsafety"      "Orgpurposeamateursports"    
##  [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"        
##  [39] "Leginflno"                   "Leginflyes"                 
##  [41] "Compofcrdirtrustno"          "Compofcrdirtrustyes"        
##  [43] "Donatefundsno"               "Donatefundsyes"             
##  [45] "Conductactyoutsideusno"      "Conductactyoutsideusyes"    
##  [47] "Financialtransofcrsno"       "Financialtransofcrsyes"     
##  [49] "Unrelgrossincm1000moreno"    "Unrelgrossincm1000moreyes"  
##  [51] "Gamingactyno"                "Gamingactyyes"              
##  [53] "Disasterreliefno"            "Disasterreliefyes"          
##  [55] "Onethirdsupportpublic"       "Onethirdsupportgifts"       
##  [57] "Benefitofcollege"            "Privatefoundation508e"      
##  [59] "Seekingretroreinstatement"   "Seekingsec7reinstatement"   
##  [61] "Gamingactyno.1"              "Gamingactyyes.1"            
##  [63] "HospitalOrChurchNo"          "HospitalOrChurchYes"        
##  [65] "Correctnessdeclaration"      "Signaturename"              
##  [67] "Signaturetitle"              "Signaturedate"              
##  [69] "EZVersionNumber"             "IDdup"                      
##  [71] "pob"                         "add.len"                    
##  [73] "input_address"               "match"                      
##  [75] "match_type"                  "out_address"                
##  [77] "lat_lon_cen"                 "tiger_line_id"              
##  [79] "tiger_line_side"             "state_fips"                 
##  [81] "county_fips"                 "tract_fips"                 
##  [83] "block_fips"                  "lon_cen"                    
##  [85] "lat_cen"                     "geocode_type"               
##  [87] "lon_ggl"                     "lat_ggl"                    
##  [89] "address_ggl"                 "lat_zip1"                   
##  [91] "lon_zip1"                    "City_zip2"                  
##  [93] "State_zip2"                  "lat_zip2"                   
##  [95] "lon_zip2"                    "city_st"                    
##  [97] "lat_cty"                     "lon_cty"                    
##  [99] "lat"                         "lon"
ppl1 <- readRDS("Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")
ppl2 <- readRDS("Data/4_GeoGoogle/PEOPLE-2014-2019v4.rds")
ppl3 <- readRDS("Data/5_ZipandCity/PEOPLE-2014-2019v5.rds")

names(ppl1)
##  [1] "ID"              "key"             "ORGNAME"         "EIN"            
##  [5] "YR"              "Signaturedate"   "Case.Number"     "Firstname"      
##  [9] "Lastname"        "Title"           "Address"         "City"           
## [13] "State"           "Zip"             "Zippl4"          "gender"         
## [17] "proportion_male" "IDdup"           "pob"             "add.len"        
## [21] "input_address"   "match"           "match_type"      "out_address"    
## [25] "lat_lon_cen"     "tiger_line_id"   "tiger_line_side" "state_fips"     
## [29] "county_fips"     "tract_fips"      "block_fips"      "lon_cen"        
## [33] "lat_cen"         "geocode_type"
names(ppl2)
##  [1] "ID"              "key"             "ORGNAME"         "EIN"            
##  [5] "YR"              "Signaturedate"   "Case.Number"     "Firstname"      
##  [9] "Lastname"        "Title"           "Address"         "City"           
## [13] "State"           "Zip"             "Zippl4"          "gender"         
## [17] "proportion_male" "IDdup"           "pob"             "add.len"        
## [21] "input_address"   "match"           "match_type"      "out_address"    
## [25] "lat_lon_cen"     "tiger_line_id"   "tiger_line_side" "state_fips"     
## [29] "county_fips"     "tract_fips"      "block_fips"      "lon_cen"        
## [33] "lat_cen"         "geocode_type"    "lon_ggl"         "lat_ggl"        
## [37] "address_ggl"
names(ppl3)
##  [1] "ID"              "key"             "ORGNAME"         "EIN"            
##  [5] "YR"              "Signaturedate"   "Case.Number"     "Firstname"      
##  [9] "Lastname"        "Title"           "Address"         "City"           
## [13] "State"           "Zip"             "Zippl4"          "gender"         
## [17] "proportion_male" "IDdup"           "pob"             "add.len"        
## [21] "input_address"   "match"           "match_type"      "out_address"    
## [25] "lat_lon_cen"     "tiger_line_id"   "tiger_line_side" "state_fips"     
## [29] "county_fips"     "tract_fips"      "block_fips"      "lon_cen"        
## [33] "lat_cen"         "geocode_type"    "lon_ggl"         "lat_ggl"        
## [37] "address_ggl"     "lat_zip1"        "lon_zip1"        "City_zip2"      
## [41] "State_zip2"      "lat_zip2"        "lon_zip2"        "city_st"        
## [45] "lat_cty"         "lon_cty"         "lat"             "lon"

a) Nonprofit Dataset

  • Dimensions
  • Nonprofit attributes (taxonomy)
  • Geographic data: results of geocode_type and trends in geographic distribution
  • Demographic data
  • Voting data

Nonprofit attributes, nonprofit locations, board member locations, census data (variables, time periods) Each board member has an ID (one we create for own records), gender, title, but no name or address.

setwd(wd)
npo <- readRDS("Data/5_ZipandCity/NONPROFITS-2014-2019v5.rds")

Geocode results SUMMARY TABLE

x <- table(npo$geocode_type, useNA = "ifany")
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary)[nrow(summary)] <- "TOTAL"
summary <- summary[c(1,3,4,2,5,6),]
pander(summary)
  frequency percent
census 182042 69.1 %
google 37794 14.4 %
zip1 33638 12.8 %
city 114 0 %
zip2 9557 3.6 %
NA. 127 0 %

How many are POB?

x <- table(npo$pob)
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary) <- c("Non-POB", "POB", "TOTAL")
pander(summary)

b) People Dataset

#exploring the latest PPL file
setwd(wd)
ppl <- readRDS("Data/5_ZipandCity/PEOPLE-2014-2019v5.rds")

Geocode results

x <- table(ppl$geocode_type, useNA = "ifany")
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary)[nrow(summary)] <- "TOTAL"
summary <- summary[c(1,3,4,2,5,6),]
pander(summary)
  frequency percent
census 689983 72.9 %
google 149739 15.8 %
zip1 87597 9.3 %
city 617 0.1 %
zip2 15509 1.6 %
NA. 2648 0.3 %

How many are POB?

x <- table(ppl$pob)
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary) <- c("Non-POB", "POB", "TOTAL")
pander(summary)

Which are the most repeated addresses?

NPOs

input <- as.data.frame(table(npo$input_address))
input <- input[order(input$Freq, decreasing = T),] 
input$Var1 <- as.character(input$Var1)
rownames(input) <- NULL
input$len <- nchar(input$Var1)

Size of input_addresses

plot(table(input$len))

Frequency of repetition of addresses:

pander(table(input$Freq))
Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
251170 4813 347 94 29 28 12 5 5 6 2 4 2 1 1
20 23 31 34 42 83 125
1 1 1 2 1 1 1

We can see that there is one address that is repeated 125 times.

The most common address is C/O WEINBERG 90 STATE ST SUITE 815, ALBANY, NY, 12207 This is an attorney building: https://weinbergpc.squarespace.com/

For PPL data

input <- as.data.frame(table(ppl$input_address))
input <- input[order(input$Freq, decreasing = T),] 
input$Var1 <- as.character(input$Var1)
rownames(input) <- NULL
input$len <- nchar(input$Var1)

pander(table(input$Freq))
Table continues below
1 2 3 4 5 6 7 8 9 10 11 12
606972 70823 25877 12255 12169 480 190 210 121 108 19 13
Table continues below
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29
11 9 7 3 4 6 1 4 1 2 1 3 1 1 1 3
31 32 33 34 44 50 89 98 122
1 1 1 1 1 1 1 1 1

One address repeated 122 times: 1300 E SHAW AVE SUITE 149, FRESNO, CA, 93710


2. Geocoding Process

  • Document decisions
  • Provide examples
  • Identify outstanding issues
  • Challenges / Troubleshooting faced (examples)
  • how we deal with special cases (examples)

2.1 Geocoding strategy

What is our strategy for enhancing the dataset with census data?

Hierarchy of geocode results: 1. Census and Google Geocode of complete addresses (we need to remove the addresses geocoded with NAs - specially google which can geocode a state) 2. Zipcode centroid 3. City centroid 4. State (might have state but its to broad for geographic analysis) 5. Missing

Geocoding Strategy Flowchart [PENDING: Add a table describing the types of addresses and the geocoding plan for each]

Special cases POBs with complete address Addresses that seem fake

Assumptions and choices Also assumptions that are made in coding data (for example, using zip code when address is missing - how many cases does that entail, does it bias the data at all, and in what direction)?

Steps Table

Geocoding Steps Table

Geocoding Results Summary

Geocoding Results Nonprofits Geocoding Results People

2.2 Census Geocoding

What where the results of this Geocode?

Exploring the Results files: how many cases where geocoded vs how many were passed through. What is the % of POBs in the No_Match or Tie?

What we learn from failed addresses/What causes a failed (No_Match or Tie) in the Census Geocode?

Large office building? Suite #s? What else cause a tie? PO box only Incomplete address Bad street address, but city and zip No zip, only city Incorrect spelling Any others???

Troubleshooting and things to consider when Geocoding through the Census service

  • Sometimes Census can block, or simply output a wrong operation one whole batch being NULL
  • IDs must be unique otherwise the geocode process ignores duplicates. This was a problem because we have some issues with the IDs.

Providing one examples of each type of problem and the solution we applied:

  • Tie in census - What else cause a tie?
  • Large office building? Suite #s?
  • PO box only
  • Incomplete address
  • Bad street address, but city and zip
  • No zip, only city
  • Incorrect spelling

Some answers are general and some are sapecific to the Census geocode (and other will be to the Google), PENDING to place each in its corresponding place.

2.3 Google Geocoding

2.5 Zip and City centroid Geocoding

3. Data gaps, Consistency tests and disambiguation

What are the data gaps and limitations of the dataset?

Comparing Census and Google results

Testing accuracy of addresses geocodes • Testing geocodes yielded by each method against manual geocoding (sample ~100) • Testing POBs through different geocoding services o POBs that only have a box number  zip code or city center. o POBs that also have addresses  test if geocoding services can capture address (maybe removing the POB with a regular expression code?) – a sample could be tested against manual geocoding.

Testing accuracy of our approach to geocoding zip codes and city centers • Testing the validity of using zipcodes or city centroids for census data. o Is there a significant difference between demographic data captured through the address (census tract) vs. Zipcode or city centroid?

Validity checks of the data • How can we identify whether the addresses in the BMS dataset are business or home addresses? o Comparing it against a list of businesses establishment addresses o Identifying the most frequently repeated addresses

Treatment of POBs and failed addresses

POBs some have addresses can we pass them through google? select the ones with highest character number.

what type of failed addresses did we find? * geocoding service is not smart. * Many are plain numbers. These could be treated as missing data, since a number will not be enough data to geocode. * Others are single letters, which will probably be missing address data too

Accuracy of Census vs Google addresses

Accuracy of Zip and City centroid geocoding results

Running some POBs through the Google geocode process

Consistency: testing accuracy of the different methods of geocoding data, testing our approach to geocoding POBs + testing accuracy of census data

4. Further questions and potential enhancements

  1. What are the main assumptions and/or limitations of the dataset (biases)
  • Business address vs. residential
  • Nonprofit address is related to the organization’s target population or member composition - or neither?
  1. What questions could arise regarding the dataset generation process, accuracy/reliability, representativeness, etc.?
  2. What additional enhancements could be done to the dataset
  3. What sort of research questions could it help answer?