Introduction

PENDING

  • show analysis of po box length
  • geocode po boxes with addresses.

In this script we will geocode POBs and failed addresses using zip and city centroids. The new dataset will be saved as a new version of the main files:

  • NONPROFITS-2014-2019v5.rds
  • PEOPLE-2014-2019v5.rds

STEPS

  1. Geocoding with zip code centroids: we will add zip code centroids to the NPO and PPL files.
  2. Geocoding with city centroids: city centroids will be added.
  3. Generating a final lat lon variable.
  4. Exploring the results

PACKAGES

1. Adding Zipcode Centroids

1.2 NPO dataset

Adding zips centroids to the files:

  • NONPROFITS-2014-2019v4.rds
  • PEOPLE-2014-2019v4.rds

The dataset with new geocodes will be saved as:

  • NONPROFITS-2014-2019v5.rds
  • PEOPLE-2014-2019v5.rds

Loading NPO main file

Adding Zip1 and Zip2 geocode data and adding a geocode_type value

census google zip1 zip2 NA
182042 37794 33638 9557 241

Only 241 addresses have no geocode.

Saving file

2. Adding City centroids to the remaining addresses

Getting City Centroids from: https://public.opendatasoft.com/explore/dataset/1000-largest-us-cities-by-population-with-geographic-coordinates/table/?sort=-rank

Note: City names may be repeated across the US, so we need to look at the state to be certain we are matching the precise one.

Loading the city file

Cities repeat unless we use the city+state:

  • 75 duplicated cities.
  • 0 duplicated city_st

Preparing the file for merge, removing unwanted vars

2.2 PPL

Creating a city_st variable in the PPL dataset

Now merging with main npo data

Adding a geocode_type value

census city google zip1 zip2 NA
689983 617 149739 87597 15509 2648

Only 2648 addresses have no geocode.

Saving file

3. Generating a Final lat and lon value

NOTE: this priotization list should be backed by a comparison. why is Google better? why is Zip1 better than Zip2? How much is the difference between Google and Census? etc.

Geocode information has comes from different sources. As the database evolves over time, geocodes might be updated. Which geocode source we use when available can be summarized in the geocode prioritization list below:

  1. Google
  2. Census
  3. Zip (Zip1 > Zip2)
  4. City

Note: for more detail on this priority list see the Research Note.

Following this list, we will generate Lat and Lon variables in each data set.

3.1 NPOs

Loading file

Creating the new variables

Now adding the prioritized lat/lon data by overwritting the values in the priority order.

Checking to see the information is consistent

## .
##  FALSE   TRUE 
## 263145    127
## 
##  FALSE   TRUE 
## 263145    127
## 
## census   city google   zip1   zip2   <NA> 
## 182042    114  37794  33638   9557    127
## x
##  TRUE 
## 37794
## x
##   TRUE 
## 182042
## x
##  TRUE 
## 33638
## x
## TRUE 
## 9557
## x
## TRUE 
##  114

Saving the file

3.2 PPL

Loading file

Creating the new variables

Now adding the prioritized lat/lon data by overwritting the values in the priority order.

Checking to see the information is consistent

## .
##  FALSE   TRUE 
## 943445   2648
## 
##  FALSE   TRUE 
## 943445   2648
## 
## census   city google   zip1   zip2   <NA> 
## 689983    617 149739  87597  15509   2648
## x
##   TRUE 
## 149739
## x
##   TRUE 
## 689983
## x
##  TRUE 
## 87597
## x
##  TRUE 
## 15509
## x
## TRUE 
##  617

Saving the file

4. Exploring data

4.1 Summary of geocoding types

For the NPO dataset

  frequency percent
google 37794 14.36 %
census 182042 69.15 %
zip1 33638 12.78 %
zip2 9557 3.63 %
city 114 0.04 %
NA. 127 0.05 %
TOTAL 263272 100 %
  • Google and Census geocodes add up to 83.5014738%
  • While zips and city add to 16.4502872%

For the PPL dataset

  frequency percent
google 149739 15.83 %
census 689983 72.93 %
zip1 87597 9.26 %
zip2 15509 1.64 %
city 617 0.07 %
NA. 2648 0.28 %
TOTAL 946093 100 %
  • Google and Census geocodes add up to 88.7568135%
  • While zips and city add to 10.9632985%

Zips2 matches more than Zips.

zips matches percent
zip1 251466 95.5 %
zip2 259894 98.7 %

Which cases did not match in Zips1?

Which cases did not match in zips2?

Are the lat lon the same between zips1 and zips2?

FALSE TRUE NA
248827 46 14399

Is the city the same between original data and the city inducted from the zip file (zips2)?

FALSE TRUE NA
23427 236455 3390
FALSE TRUE NA
8.9 89.8 1.3

Is the State the same between original and the state inducted from the zip file (zips2)?

FALSE TRUE NA
1504 258336 3432
FALSE TRUE NA
0.6 98.1 1.3