Introduction

In this script we will augment the dataset with selected census variables using the geolocations.

We will use multiple methods to include census data…

We are using the IPUMS GeoMaker Tool, which attaches contextual data to your point data by determining the census geographic unit in which each point lies and attaching characteristics of that unit to the point record. The initial release of GeoMarker attaches data from the 2017 American Community Survey 5-year data at the census tract level.

Steven Manson, Jonathan Schroeder, David Van Riper, and Steven Ruggles. IPUMS National Historical Geographic Information System: Version 14.0 [Database]. Minneapolis, MN: IPUMS. 2019. http://doi.org/10.18128/D050.V14.0

NOTE: using this data needs a request, see https://nhgis.org/research/citation

STEPS

  1. Getting Census Data for NPO geocodes.
  2. Getting Census Data for PPL geocodes.

Input files:

  • NONPROFITS-2014-2019v5.rds
  • PEOPLE-2014-2019v5.rds

Output Files:

  • NONPROFITS-2014-2019v6.rds
  • PEOPLE-2014-2019v6.rds

NOTES

PACKAGES

1. Getting Nonprofit Census Data

1.2 Manually querying IPUMS Geomaker to get census data

Loading NPO census data results

The output of the census merge has a duplicated data point.

## [1] 263145
## [1] 263145
## [1] 263146
## [1] 263145

Measures for the duplicated values seem very different:

Table continues below
  key lat lon GISJOIN STATE STATEA
41575 41591 37.12 -120.3 G0600390000300 California 6
41576 41591 37.12 -120.3 G0600390000202 California 6
Table continues below
  COUNTY COUNTYA TRACTA GM001_2017 GM002_2017
41575 Madera County 39 300 0.04142 0.3172
41576 Madera County 39 202 0.1036 0.194
Table continues below
  GM003_2017 GM004_2017 GM005_2017 GM006_2017 GM007_2017
41575 34841 0.432 0.2861 0.417 0.02469
41576 58568 0.4436 0.1307 0.6042 0.0103
  GM008_2017 GM009_2017 GM010_2017
41575 0.6271 887.1 287.5
41576 0.8058 20.76 7.676

1.3 Testing duplicated results

We test different ways to solve the duplicated case

  1. running the duplicated address again through IPUMS

Loading results

Table continues below
key lat lon GISJOIN STATE STATEA COUNTY
41591 37.12 -120.3 G0600390000202 California 6 Madera County
41591 37.12 -120.3 G0600390000300 California 6 Madera County
Table continues below
COUNTYA TRACTA GM001_2017 GM002_2017 GM003_2017 GM004_2017
39 202 0.1036 0.194 58568 0.4436
39 300 0.04142 0.3172 34841 0.432
GM005_2017 GM006_2017 GM007_2017 GM008_2017 GM009_2017 GM010_2017
0.1307 0.6042 0.0103 0.8058 20.76 7.676
0.2861 0.417 0.02469 0.6271 887.1 287.5

We get the same issue, double results.

  1. Running the duplicated cases again through IPUMS, but this time using their address data (not lat/lon)

The query fails to make the match.

  1. Manually getting new lat/lon data from google maps and running the case through IPUMS again
## [1] "240 N 1ST STREET, CHOWCHILLA, CA, 93610"

Loading results

Table continues below
key lat lon GISJOIN STATE STATEA COUNTY
1 37.13 -120.3 G0600390000202 California 6 Madera County
1 37.13 -120.3 G0600390000300 California 6 Madera County
Table continues below
COUNTYA TRACTA GM001_2017 GM002_2017 GM003_2017 GM004_2017
39 202 0.1036 0.194 58568 0.4436
39 300 0.04142 0.3172 34841 0.432
GM005_2017 GM006_2017 GM007_2017 GM008_2017 GM009_2017 GM010_2017
0.1307 0.6042 0.0103 0.8058 20.76 7.676
0.2861 0.417 0.02469 0.6271 887.1 287.5

Despite using different lat/lons, we get the same issue, double results.

  1. Manually identifying the census tract of the address

Using this website and google maps, we were able to determine that the location is within tract 300.

Updating the IPUMS results to exclude the duplicate case that is not in tract 300

2. Getting Board Member Census Data for board members

2.2 Manually querying the census using IPUMS

After manually getting the census data from IPUMS, we load the results

The output of the census merge has duplicates

## [1] 943445
## [1] 943445
## [1] 943452
## [1] 943445

IPUMS Geomaker results shows all the repeated cases as in Madera County, CA, which is not accurate for all of them.

Table continues below
  key lat lon GISJOIN STATE STATEA
55975 56103 37.12 -120.3 G0600390000202 California 6
55976 56103 37.12 -120.3 G0600390000300 California 6
55979 56106 37.12 -120.3 G0600390000202 California 6
55980 56106 37.12 -120.3 G0600390000300 California 6
154659 155140 27.52 -82.65 G0600390000300 California 6
154660 155140 27.52 -82.65 G0600390000202 California 6
154662 155142 37.34 -121.9 G0600390000202 California 6
154663 155142 37.34 -121.9 G0600390000300 California 6
349310 350685 40.13 -75.01 G0600390000202 California 6
349311 350685 40.13 -75.01 G0600390000300 California 6
349312 350686 40.12 -75.51 G0600390000300 California 6
349313 350686 40.12 -75.51 G0600390000202 California 6
794100 796630 31.33 -94.71 G0600390000202 California 6
794101 796630 31.33 -94.71 G0600390000300 California 6
Table continues below
  COUNTY COUNTYA TRACTA GM001_2017 GM002_2017
55975 Madera County 39 202 0.1036 0.194
55976 Madera County 39 300 0.04142 0.3172
55979 Madera County 39 202 0.1036 0.194
55980 Madera County 39 300 0.04142 0.3172
154659 Madera County 39 300 0.04142 0.3172
154660 Madera County 39 202 0.1036 0.194
154662 Madera County 39 202 0.1036 0.194
154663 Madera County 39 300 0.04142 0.3172
349310 Madera County 39 202 0.1036 0.194
349311 Madera County 39 300 0.04142 0.3172
349312 Madera County 39 300 0.04142 0.3172
349313 Madera County 39 202 0.1036 0.194
794100 Madera County 39 202 0.1036 0.194
794101 Madera County 39 300 0.04142 0.3172
Table continues below
  GM003_2017 GM004_2017 GM005_2017 GM006_2017 GM007_2017
55975 58568 0.4436 0.1307 0.6042 0.0103
55976 34841 0.432 0.2861 0.417 0.02469
55979 58568 0.4436 0.1307 0.6042 0.0103
55980 34841 0.432 0.2861 0.417 0.02469
154659 34841 0.432 0.2861 0.417 0.02469
154660 58568 0.4436 0.1307 0.6042 0.0103
154662 58568 0.4436 0.1307 0.6042 0.0103
154663 34841 0.432 0.2861 0.417 0.02469
349310 58568 0.4436 0.1307 0.6042 0.0103
349311 34841 0.432 0.2861 0.417 0.02469
349312 34841 0.432 0.2861 0.417 0.02469
349313 58568 0.4436 0.1307 0.6042 0.0103
794100 58568 0.4436 0.1307 0.6042 0.0103
794101 34841 0.432 0.2861 0.417 0.02469
  GM008_2017 GM009_2017 GM010_2017
55975 0.8058 20.76 7.676
55976 0.6271 887.1 287.5
55979 0.8058 20.76 7.676
55980 0.6271 887.1 287.5
154659 0.6271 887.1 287.5
154660 0.8058 20.76 7.676
154662 0.8058 20.76 7.676
154663 0.6271 887.1 287.5
349310 0.8058 20.76 7.676
349311 0.6271 887.1 287.5
349312 0.6271 887.1 287.5
349313 0.8058 20.76 7.676
794100 0.8058 20.76 7.676
794101 0.6271 887.1 287.5

2.3 Testing duplicated results

We test different ways to solve the duplicated case

  1. Running the duplicated address again through IPUMS

Loading results

Table continues below
key lat lon GISJOIN STATE STATEA COUNTY COUNTYA TRACTA
56103 37.12 -120.3 NA NA NA NA NA NA
56106 37.12 -120.3 NA NA NA NA NA NA
155140 27.52 -82.65 NA NA NA NA NA NA
155142 37.34 -121.9 NA NA NA NA NA NA
350685 40.13 -75.01 NA NA NA NA NA NA
350686 40.12 -75.51 NA NA NA NA NA NA
796630 31.33 -94.71 NA NA NA NA NA NA
Table continues below
GM001_2017 GM002_2017 GM003_2017 GM004_2017 GM006_2017 GM005_2017
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
NA NA NA NA NA NA
GM007_2017 GM008_2017 GM009_2017 GM010_2017
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA
NA NA NA NA

We get NAs for all

  1. Running the duplicated cases again through IPUMS, but this time using their address data (not lat/lon)

The query fails to make the match.

  1. Geocoding the 7 cases using Google to get new lat/lon data and running the case through IPUMS again

Loading results

Table continues below
key lat lon GISJOIN STATE STATEA
56103 37.12 -120.3 G0600390000300 California 6
56103 37.12 -120.3 G0600390000202 California 6
56106 37.11 -120.3 G0600390000202 California 6
56106 37.11 -120.3 G0600390000300 California 6
155140 27.52 -82.65 G1200810001204 Florida 12
155142 37.34 -121.9 G0600850500200 California 6
350685 40.13 -75.01 G4201010036501 Pennsylvania 42
350686 40.12 -75.51 G4200290300502 Pennsylvania 42
796630 31.33 -94.72 G4800050000800 Texas 48
Table continues below
COUNTY COUNTYA TRACTA GM001_2017 GM002_2017 GM003_2017
Madera County 39 300 0.04142 0.3172 34841
Madera County 39 202 0.1036 0.194 58568
Madera County 39 202 0.1036 0.194 58568
Madera County 39 300 0.04142 0.3172 34841
Manatee County 81 1204 0.03951 0.04004 74955
Santa Clara County 85 500200 0.05171 0.1639 99942
Philadelphia County 101 36501 0.05577 0.1258 45610
Chester County 29 300502 0.03798 0.05477 89353
Angelina County 5 800 0.03821 0.1128 51775
Table continues below
GM004_2017 GM005_2017 GM006_2017 GM007_2017 GM008_2017 GM009_2017
0.432 0.2861 0.417 0.02469 0.6271 887.1
0.4436 0.1307 0.6042 0.0103 0.8058 20.76
0.4436 0.1307 0.6042 0.0103 0.8058 20.76
0.432 0.2861 0.417 0.02469 0.6271 887.1
0.4246 0.07564 0.8201 0.03455 0.9836 449.4
0.3925 0.2143 0.2833 0.05557 0.8229 3727
0.4486 0.1857 0.4692 0.09502 0.9118 2897
0.4693 0.1572 0.8906 0.03036 0.9565 482.8
0.3936 0.2931 0.751 0.1521 0.8824 317.5
GM010_2017
287.5
7.676
7.676
287.5
275.4
1331
1300
189.9
135.6

We solved the duplication in all cases except two that are from CA, Madera County.

Using this website and google maps, we were able to determine both locations are within tract 300.

Note: When imputting the address manually through google maps, the lat/lons we get are a bit different, than the ones we get from the google gecoding service. They are very close, though.

key source lat lon
56103 gmaps 37.123258 -120.267232
56103 gglgeo 37.12309 -120.26754
56106 gmaps 37.114607 -120.262027
56106 gglgeo 37.11442 -120.26255

Selecting the data

2.4 Merging PPL Census Data to main file

Combining the original results with the new duplicates

Preparing the results for the merge

##  [1] "key"        "lat"        "lon"        "GISJOIN"    "STATE"     
##  [6] "STATEA"     "COUNTY"     "COUNTYA"    "TRACTA"     "GM001_2017"
## [11] "GM002_2017" "GM003_2017" "GM004_2017" "GM005_2017" "GM006_2017"
## [16] "GM007_2017" "GM008_2017" "GM009_2017" "GM010_2017"

Merging

Finall, we add the the google geocode informaiton we got for the 7 duplicated cases and update their geocode_type from census to google.

## [1] "key"           "input_address" "lon"           "lat"          
## [5] "address"
##  [1] "ID"              "key"             "ORGNAME"         "EIN"            
##  [5] "YR"              "Signaturedate"   "Case.Number"     "Firstname"      
##  [9] "Lastname"        "Title"           "Address"         "City"           
## [13] "State"           "Zip"             "Zippl4"          "gender"         
## [17] "proportion_male" "IDdup"           "pob"             "add.len"        
## [21] "input_address"   "match"           "match_type"      "out_address"    
## [25] "lat_lon_cen"     "tiger_line_id"   "tiger_line_side" "state_fips"     
## [29] "county_fips"     "tract_fips"      "block_fips"      "lon_cen"        
## [33] "lat_cen"         "geocode_type"    "lon_ggl"         "lat_ggl"        
## [37] "address_ggl"     "lat_zip1"        "lon_zip1"        "City_zip2"      
## [41] "State_zip2"      "lat_zip2"        "lon_zip2"        "city_st"        
## [45] "lat_cty"         "lon_cty"         "lat"             "lon"            
## [49] "GISJOIN"         "STATE"           "STATEFIPS"       "COUNTY"         
## [53] "COUNTYFIPS"      "TRACTFIPS"       "unemp"           "poverty"        
## [57] "medinc"          "inequality"      "single"          "ownerocc"       
## [61] "black"           "hs"              "p.density"       "h.density"
##  [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE  TRUE
key lon_ggl lat_ggl address_ggl
56103 NA NA NA
56103 NA NA NA
56106 NA NA NA
56106 NA NA NA
155140 NA NA NA
155140 NA NA NA
155142 NA NA NA
155142 NA NA NA
350685 NA NA NA
350685 NA NA NA
350686 NA NA NA
350686 NA NA NA
796630 NA NA NA
796630 NA NA NA

Saving new main dataset