In this Report we will cover:
Loading packages and files
Sources, key variables, dimensions, etc. Nonprofit Dataset
For more information on the variables see the Data Dictionary [PENDING]
Dataset development/augmentation process
Dataset Development Process
1023-EZ data has been used to generate two datasets:
The datasets have been manipulated to remove any sensitive information and to include the following enhancements:
Main Characteristics
To make geocoding more efficient, we used input_address as the key.
The Nonprofit datasets (Nonprofit and Board Members) have been generated based on IRS’ 1023-EZ forms for year 2014 to 2019, which are publicly available here.
Nonprofits submit 1023-EZ forms to apply for recognition as a tax-exempt organization under Section 501(c)(3) of the Internal Revenue Code, for more deatils visit the IRS website. Only organizations that meet certain characteristics are elegible, some of the main requierements are:
For more detail see p. 13 of the 1023-EZ application instructions
According to the 1023-EZ instructions, nonprofits must submit detailed organizaiton information, including: the organization’s name, mailing address, employee identification number (EIN), mission, contact person, and name, title and mailing address information of up to five organization officers, directors, and/or trustees. 1023-EZ instructions prioritize reporting of organization members in the following order:
If an individual serves in more than one office (for example, as both an officer and director), list this individual on only one line and list all offices held.
An officer is a person elected or appointed to manage the organization’s daily operations, such as president, vice president, secretary, treasurer, and, in some cases, board chair. The officers of an organization are determined by reference to its organizing document, bylaws, or resolutions of its governing body, or otherwise designated consistent with state law. A director or trustee is a member of the organization’s governing body, but only if the member has voting rights.
Do the e-filers represent all or most of 1023-EZ filers? What proportion of 1023 filers are EZ filers? What proportion of small nonprofits are recent 1023-EZ filers? Are EZ filers representative of all 1023 filers? If not, what is particular about this sample?
[Pending]
Loading the data
npo1 <- readRDS("Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds")
npo2 <- readRDS("Data/4_GeoGoogle/NONPROFITS-2014-2019v4.rds")
npo3 <- readRDS("Data/5_ZipandCity/NONPROFITS-2014-2019v5.rds")
names(npo1)
## [1] "ID" "key"
## [3] "ORGNAME" "EIN"
## [5] "YR" "Mission"
## [7] "Orgname1" "Orgname2"
## [9] "Case.Number" "Formrevision"
## [11] "Eligibilityworksheet" "Address"
## [13] "City" "State"
## [15] "Zip" "Zippl4"
## [17] "Accountingperiodend" "Userfeesubmitted"
## [19] "Orgurl" "Orgtypecorp"
## [21] "Orgtypeunincorp" "Orgtypetrust"
## [23] "Necessaryorgdocs" "Incorporateddate"
## [25] "Incorporatedstate" "Containslimitation"
## [27] "Doesnotexpresslyempower" "Containsdissolution"
## [29] "Nteecode" "Orgpurposecharitable"
## [31] "Orgpurposereligious" "Orgpurposeeducational"
## [33] "Orgpurposescientific" "Orgpurposeliterary"
## [35] "Orgpurposepublicsafety" "Orgpurposeamateursports"
## [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"
## [39] "Leginflno" "Leginflyes"
## [41] "Compofcrdirtrustno" "Compofcrdirtrustyes"
## [43] "Donatefundsno" "Donatefundsyes"
## [45] "Conductactyoutsideusno" "Conductactyoutsideusyes"
## [47] "Financialtransofcrsno" "Financialtransofcrsyes"
## [49] "Unrelgrossincm1000moreno" "Unrelgrossincm1000moreyes"
## [51] "Gamingactyno" "Gamingactyyes"
## [53] "Disasterreliefno" "Disasterreliefyes"
## [55] "Onethirdsupportpublic" "Onethirdsupportgifts"
## [57] "Benefitofcollege" "Privatefoundation508e"
## [59] "Seekingretroreinstatement" "Seekingsec7reinstatement"
## [61] "Gamingactyno.1" "Gamingactyyes.1"
## [63] "HospitalOrChurchNo" "HospitalOrChurchYes"
## [65] "Correctnessdeclaration" "Signaturename"
## [67] "Signaturetitle" "Signaturedate"
## [69] "EZVersionNumber" "IDdup"
## [71] "pob" "add.len"
## [73] "input_address" "match"
## [75] "match_type" "out_address"
## [77] "lat_lon_cen" "tiger_line_id"
## [79] "tiger_line_side" "state_fips"
## [81] "county_fips" "tract_fips"
## [83] "block_fips" "lon_cen"
## [85] "lat_cen" "geocode_type"
## [1] "ID" "key"
## [3] "ORGNAME" "EIN"
## [5] "YR" "Mission"
## [7] "Orgname1" "Orgname2"
## [9] "Case.Number" "Formrevision"
## [11] "Eligibilityworksheet" "Address"
## [13] "City" "State"
## [15] "Zip" "Zippl4"
## [17] "Accountingperiodend" "Userfeesubmitted"
## [19] "Orgurl" "Orgtypecorp"
## [21] "Orgtypeunincorp" "Orgtypetrust"
## [23] "Necessaryorgdocs" "Incorporateddate"
## [25] "Incorporatedstate" "Containslimitation"
## [27] "Doesnotexpresslyempower" "Containsdissolution"
## [29] "Nteecode" "Orgpurposecharitable"
## [31] "Orgpurposereligious" "Orgpurposeeducational"
## [33] "Orgpurposescientific" "Orgpurposeliterary"
## [35] "Orgpurposepublicsafety" "Orgpurposeamateursports"
## [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"
## [39] "Leginflno" "Leginflyes"
## [41] "Compofcrdirtrustno" "Compofcrdirtrustyes"
## [43] "Donatefundsno" "Donatefundsyes"
## [45] "Conductactyoutsideusno" "Conductactyoutsideusyes"
## [47] "Financialtransofcrsno" "Financialtransofcrsyes"
## [49] "Unrelgrossincm1000moreno" "Unrelgrossincm1000moreyes"
## [51] "Gamingactyno" "Gamingactyyes"
## [53] "Disasterreliefno" "Disasterreliefyes"
## [55] "Onethirdsupportpublic" "Onethirdsupportgifts"
## [57] "Benefitofcollege" "Privatefoundation508e"
## [59] "Seekingretroreinstatement" "Seekingsec7reinstatement"
## [61] "Gamingactyno.1" "Gamingactyyes.1"
## [63] "HospitalOrChurchNo" "HospitalOrChurchYes"
## [65] "Correctnessdeclaration" "Signaturename"
## [67] "Signaturetitle" "Signaturedate"
## [69] "EZVersionNumber" "IDdup"
## [71] "pob" "add.len"
## [73] "input_address" "match"
## [75] "match_type" "out_address"
## [77] "lat_lon_cen" "tiger_line_id"
## [79] "tiger_line_side" "state_fips"
## [81] "county_fips" "tract_fips"
## [83] "block_fips" "lon_cen"
## [85] "lat_cen" "geocode_type"
## [87] "lon_ggl" "lat_ggl"
## [89] "address_ggl"
## [1] "ID" "key"
## [3] "ORGNAME" "EIN"
## [5] "YR" "Mission"
## [7] "Orgname1" "Orgname2"
## [9] "Case.Number" "Formrevision"
## [11] "Eligibilityworksheet" "Address"
## [13] "City" "State"
## [15] "Zip" "Zippl4"
## [17] "Accountingperiodend" "Userfeesubmitted"
## [19] "Orgurl" "Orgtypecorp"
## [21] "Orgtypeunincorp" "Orgtypetrust"
## [23] "Necessaryorgdocs" "Incorporateddate"
## [25] "Incorporatedstate" "Containslimitation"
## [27] "Doesnotexpresslyempower" "Containsdissolution"
## [29] "Nteecode" "Orgpurposecharitable"
## [31] "Orgpurposereligious" "Orgpurposeeducational"
## [33] "Orgpurposescientific" "Orgpurposeliterary"
## [35] "Orgpurposepublicsafety" "Orgpurposeamateursports"
## [37] "Orgpurposecrueltyprevention" "Qualifyforexemption"
## [39] "Leginflno" "Leginflyes"
## [41] "Compofcrdirtrustno" "Compofcrdirtrustyes"
## [43] "Donatefundsno" "Donatefundsyes"
## [45] "Conductactyoutsideusno" "Conductactyoutsideusyes"
## [47] "Financialtransofcrsno" "Financialtransofcrsyes"
## [49] "Unrelgrossincm1000moreno" "Unrelgrossincm1000moreyes"
## [51] "Gamingactyno" "Gamingactyyes"
## [53] "Disasterreliefno" "Disasterreliefyes"
## [55] "Onethirdsupportpublic" "Onethirdsupportgifts"
## [57] "Benefitofcollege" "Privatefoundation508e"
## [59] "Seekingretroreinstatement" "Seekingsec7reinstatement"
## [61] "Gamingactyno.1" "Gamingactyyes.1"
## [63] "HospitalOrChurchNo" "HospitalOrChurchYes"
## [65] "Correctnessdeclaration" "Signaturename"
## [67] "Signaturetitle" "Signaturedate"
## [69] "EZVersionNumber" "IDdup"
## [71] "pob" "add.len"
## [73] "input_address" "match"
## [75] "match_type" "out_address"
## [77] "lat_lon_cen" "tiger_line_id"
## [79] "tiger_line_side" "state_fips"
## [81] "county_fips" "tract_fips"
## [83] "block_fips" "lon_cen"
## [85] "lat_cen" "geocode_type"
## [87] "lon_ggl" "lat_ggl"
## [89] "address_ggl" "lat_zip1"
## [91] "lon_zip1" "City_zip2"
## [93] "State_zip2" "lat_zip2"
## [95] "lon_zip2" "city_st"
## [97] "lat_cty" "lon_cty"
## [99] "lat" "lon"
ppl1 <- readRDS("Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")
ppl2 <- readRDS("Data/4_GeoGoogle/PEOPLE-2014-2019v4.rds")
ppl3 <- readRDS("Data/5_ZipandCity/PEOPLE-2014-2019v5.rds")
names(ppl1)
## [1] "ID" "key" "ORGNAME" "EIN"
## [5] "YR" "Signaturedate" "Case.Number" "Firstname"
## [9] "Lastname" "Title" "Address" "City"
## [13] "State" "Zip" "Zippl4" "gender"
## [17] "proportion_male" "IDdup" "pob" "add.len"
## [21] "input_address" "match" "match_type" "out_address"
## [25] "lat_lon_cen" "tiger_line_id" "tiger_line_side" "state_fips"
## [29] "county_fips" "tract_fips" "block_fips" "lon_cen"
## [33] "lat_cen" "geocode_type"
## [1] "ID" "key" "ORGNAME" "EIN"
## [5] "YR" "Signaturedate" "Case.Number" "Firstname"
## [9] "Lastname" "Title" "Address" "City"
## [13] "State" "Zip" "Zippl4" "gender"
## [17] "proportion_male" "IDdup" "pob" "add.len"
## [21] "input_address" "match" "match_type" "out_address"
## [25] "lat_lon_cen" "tiger_line_id" "tiger_line_side" "state_fips"
## [29] "county_fips" "tract_fips" "block_fips" "lon_cen"
## [33] "lat_cen" "geocode_type" "lon_ggl" "lat_ggl"
## [37] "address_ggl"
## [1] "ID" "key" "ORGNAME" "EIN"
## [5] "YR" "Signaturedate" "Case.Number" "Firstname"
## [9] "Lastname" "Title" "Address" "City"
## [13] "State" "Zip" "Zippl4" "gender"
## [17] "proportion_male" "IDdup" "pob" "add.len"
## [21] "input_address" "match" "match_type" "out_address"
## [25] "lat_lon_cen" "tiger_line_id" "tiger_line_side" "state_fips"
## [29] "county_fips" "tract_fips" "block_fips" "lon_cen"
## [33] "lat_cen" "geocode_type" "lon_ggl" "lat_ggl"
## [37] "address_ggl" "lat_zip1" "lon_zip1" "City_zip2"
## [41] "State_zip2" "lat_zip2" "lon_zip2" "city_st"
## [45] "lat_cty" "lon_cty" "lat" "lon"
Nonprofit attributes, nonprofit locations, board member locations, census data (variables, time periods) Each board member has an ID (one we create for own records), gender, title, but no name or address.
Geocode results SUMMARY TABLE
x <- table(npo$geocode_type, useNA = "ifany")
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary)[nrow(summary)] <- "TOTAL"
summary <- summary[c(1,3,4,2,5,6),]
pander(summary)
frequency | percent | |
---|---|---|
census | 182042 | 69.1 % |
37794 | 14.4 % | |
zip1 | 33638 | 12.8 % |
city | 114 | 0 % |
zip2 | 9557 | 3.6 % |
NA. | 127 | 0 % |
How many are POB?
x <- table(npo$pob)
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary) <- c("Non-POB", "POB", "TOTAL")
pander(summary)
Geocode results
x <- table(ppl$geocode_type, useNA = "ifany")
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary)[nrow(summary)] <- "TOTAL"
summary <- summary[c(1,3,4,2,5,6),]
pander(summary)
frequency | percent | |
---|---|---|
census | 689983 | 72.9 % |
149739 | 15.8 % | |
zip1 | 87597 | 9.3 % |
city | 617 | 0.1 % |
zip2 | 15509 | 1.6 % |
NA. | 2648 | 0.3 % |
How many are POB?
x <- table(ppl$pob)
y <- prop.table(x)
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 1)
summary$percent <- paste0(round(summary$percent*100,1)," %")
rownames(summary) <- c("Non-POB", "POB", "TOTAL")
pander(summary)
Which are the most repeated addresses?
NPOs
input <- as.data.frame(table(npo$input_address))
input <- input[order(input$Freq, decreasing = T),]
input$Var1 <- as.character(input$Var1)
rownames(input) <- NULL
input$len <- nchar(input$Var1)
Size of input_addresses
Frequency of repetition of addresses:
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
251170 | 4813 | 347 | 94 | 29 | 28 | 12 | 5 | 5 | 6 | 2 | 4 | 2 | 1 | 1 |
20 | 23 | 31 | 34 | 42 | 83 | 125 |
---|---|---|---|---|---|---|
1 | 1 | 1 | 2 | 1 | 1 | 1 |
We can see that there is one address that is repeated 125 times.
The most common address is C/O WEINBERG 90 STATE ST SUITE 815, ALBANY, NY, 12207 This is an attorney building: https://weinbergpc.squarespace.com/
For PPL data
input <- as.data.frame(table(ppl$input_address))
input <- input[order(input$Freq, decreasing = T),]
input$Var1 <- as.character(input$Var1)
rownames(input) <- NULL
input$len <- nchar(input$Var1)
pander(table(input$Freq))
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
---|---|---|---|---|---|---|---|---|---|---|---|
606972 | 70823 | 25877 | 12255 | 12169 | 480 | 190 | 210 | 121 | 108 | 19 | 13 |
13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 29 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
11 | 9 | 7 | 3 | 4 | 6 | 1 | 4 | 1 | 2 | 1 | 3 | 1 | 1 | 1 | 3 |
31 | 32 | 33 | 34 | 44 | 50 | 89 | 98 | 122 |
---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
One address repeated 122 times: 1300 E SHAW AVE SUITE 149, FRESNO, CA, 93710
What is our strategy for enhancing the dataset with census data?
Hierarchy of geocode results: 1. Census and Google Geocode of complete addresses (we need to remove the addresses geocoded with NAs - specially google which can geocode a state) 2. Zipcode centroid 3. City centroid 4. State (might have state but its to broad for geographic analysis) 5. Missing
[PENDING: Add a table describing the types of addresses and the geocoding plan for each]
Special cases POBs with complete address Addresses that seem fake
Assumptions and choices Also assumptions that are made in coding data (for example, using zip code when address is missing - how many cases does that entail, does it bias the data at all, and in what direction)?
Exploring the Results files: how many cases where geocoded vs how many were passed through. What is the % of POBs in the No_Match or Tie?
Large office building? Suite #s? What else cause a tie? PO box only Incomplete address Bad street address, but city and zip No zip, only city Incorrect spelling Any others???
Providing one examples of each type of problem and the solution we applied:
Some answers are general and some are sapecific to the Census geocode (and other will be to the Google), PENDING to place each in its corresponding place.
What are the data gaps and limitations of the dataset?
Testing accuracy of addresses geocodes • Testing geocodes yielded by each method against manual geocoding (sample ~100) • Testing POBs through different geocoding services o POBs that only have a box number zip code or city center. o POBs that also have addresses test if geocoding services can capture address (maybe removing the POB with a regular expression code?) – a sample could be tested against manual geocoding.
Testing accuracy of our approach to geocoding zip codes and city centers • Testing the validity of using zipcodes or city centroids for census data. o Is there a significant difference between demographic data captured through the address (census tract) vs. Zipcode or city centroid?
Validity checks of the data • How can we identify whether the addresses in the BMS dataset are business or home addresses? o Comparing it against a list of businesses establishment addresses o Identifying the most frequently repeated addresses
POBs some have addresses can we pass them through google? select the ones with highest character number.
what type of failed addresses did we find? * geocoding service is not smart. * Many are plain numbers. These could be treated as missing data, since a number will not be enough data to geocode. * Others are single letters, which will probably be missing address data too
Running some POBs through the Google geocode process
Consistency: testing accuracy of the different methods of geocoding data, testing our approach to geocoding POBs + testing accuracy of census data