In this script we will work with the NONPROFIT-2014-2019.rds and PEOPLE-2014-2019.rds files to:
New datasets will be stored as NONPROFIT-2014-2019v2.rds and PEOPLE-2014-2019v2.rds. In addition address files NPOAddresses_census.rds and PPLAddresses_census.rds will be created to pass them through the Census Geocoding service.
library( dplyr )
library(tidyverse)
library( tidyr )
library( pander )
library( httr )
library( stringr )
#update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd)
dir()
## [1] "_Archive" "Data"
## [3] "Data2" "Documentation"
## [5] "GeoProcessReport.html" "GeoProcessReport.Rmd"
## [7] "Icon\r" "Images"
## [9] "Research Note" "Step-01-ProcessRawData.html"
## [11] "Step-01-ProcessRawData.Rmd" "Step-02-POBsandGaps.html"
## [13] "Step-02-POBsandGaps.Rmd" "Step-03-CensusGeo.Rmd"
## [15] "Step-03-fix.Rmd" "Step-03.1-CensusGeo.html"
## [17] "Step-03.2-CompilingResults.Rmd" "Step-03.3-SecondCensusGeo.Rmd"
## [19] "Step-04-GoogleGeo.Rmd" "Step-05-ZipsandCityGeo.Rmd"
## [21] "Step-06-Disambiguation.Rmd" "StepsTable.fld"
## [23] "StepsTable.html" "zip-codes"
We will be looking for POB (Post Office Boxes). Might also find APOs (Army Post Office), FPOs (Fleet Post Office) and DPOs (Diplomatic Post Office).
Loading file and creating a POB variable to flag addresses that are pobs.
#loading data
npo <- readRDS("Data/2_InputData/NONPROFITS-2014-2019.rds")
#creating a variable for pob flags
npo$pob <- NA
Identifying the pobs using a regular expression code
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
pob <- npo$Address[which(x)]
head(pob, 5)
## [1] "P O BOX 284" "PO BOX 1111" "PO BOX 261" "PO BOX 98" "PO BOX 196"
Addressess identified as POBs are sometimes very lengthy…
#including a character length var
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)
#arranging it
pobs <- arrange(pobs, desc(pob.len))
head(pobs)
7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|
155 | 452 | 2197 | 10804 | 8907 | 4277 | 2838 | 446 | 60 |
Most NPO POB addresses are clusterd around the 11 character length.
After taking a closer look, we notice that lengthy POB addresses have a mix of both POB info and a regular address.
Take a look at POBs with 25 characters
A POB flag and address character length var will be added to the actual dataset
#char length
npo$add.len <- nchar(npo$Address)
#pob flag
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
npo$pob <- as.numeric(x) #adding a flag for pobs identified
How many pobs?
x <- as.data.frame(prop.table(table(npo$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
pob | percent |
---|---|
0 | 87.6 % |
1 | 12.4 % |
We will merge the ADDRESS, CITY, STATE and ZIP variable to create an INPUT ADDRESS var in the NPO dataset for geocoding.
#creating an Input Address variable
npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ")
Saving the npo file
Loading file and creating a pob variable to flag addresses that are pobs.
#loading data
ppl <- readRDS("Data/2_InputData/PEOPLE-2014-2019.rds")
#creating a variable for pob flags
ppl$pob <- NA
Identifying the pobs using a regular expression code
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
pob <- ppl$Address[which(x)]
head(pob, 5)
## [1] "P O BOX 1742" "RR2 BOX 6464" "PO BOX 7" "PO BOX 7" "PO BOX 7"
Similar to the NPO addresses, some POBs are mixedwith regular address info:
7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 |
---|---|---|---|---|---|---|---|---|
422 | 834 | 3801 | 19002 | 14801 | 6833 | 5138 | 826 | 160 |
Most ppl POB addresses are also clusterd around the 11 character length.
Adding POB flag and add.len var
ppl$add.len <- nchar(ppl$Address)
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
ppl$pob <- as.numeric(x) #adding a flag for pobs identified
How many pobs?
x <- as.data.frame(prop.table(table(ppl$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)
pob | percent |
---|---|
0 | 94.1 % |
1 | 5.9 % |
We will merge the ADDRESS, CITY, STATE and ZIP variables to create an INPUT ADDRESS var in the ppl dataset for geocoding.
#creating an Input Address variable
ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
Saving rds
In this section we will look into the Addresses and try to identify any problem with the data. We will also create an address_input variable which will be used for geocoding.
The variables that contain address informaiton are:
Subsetting a smaller file to explore addresses, removing pobs
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- npo[,nmz]
#dropping all pobs
x <- add$pob == 0
add <- add[x,]
Using the Address character length variable (add.len) to see the distribution and determine when an address is short enough to assume it is unintelligible.
##
## 13 14 15 16 17 18 19 20 21
## 9176 13270 17502 21068 21944 21499 19681 16861 14434
NPO Addresses are clustered around 18 characters
Addresses with length 1-2 are unintelligible and will probably have to be geocoded using zip codes or city centers
## [1] "68" "38" "26" "4" "68" "E" "55" "PO" "32" "81" "N" "13" "79" "91" "24"
Looking at those with length 3-4, they still look unintelligible
## [1] "573" "332" "7472" "1377" "7232" "564" "659" "463" "567" "339"
## [11] "734" "782" "7235" "3815" "6003"
Addresses with length 5-6 also seem to be numbers with not much sense
## [1] "20832" "66612" "720038" "330322" "12492" "11855" "530591" "490141"
## [9] "950691" "26957" "65482" "11065" "42368" "342341" "65745"
However, looking at those with length 7-8, we start to recognize proper addressess.
## [1] "C/O 356" "P O 4911" "HWY 37 N" "919 BERT" "218 LAWN" "276 KING"
## [7] "45 TESLA" "45 TESLA" "408 J ST" "301 UCB" "SUITE 8" "3 DEODAR"
## [13] "832 30TH" "49 RR 2" "497 MAIN"
Looking for NAs within the other relevant columns
City var has some unintelligible values
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
## [1] , ANTIOCH , MONTCLAIR :AGRANGE
## [4] :LITHONIA ` 07013
## [7] 08088-000 0WASSO 1. PACOIMA
## [10] 11231 13035 OLIVE BLVD 1474 BERGER DR
## [13] 20748 2212 SW 55TH 33162
## [16] 37 WOODS DRIVE 3A 45434
## [19] 46TH COURT 48723 50309
## [22] 50674 54911 63103
## [25] 63136 76052 77004
## [28] 818 FARMERS UNION RD 93420 A
## [31] AARONSBURG ABBEVILLE ABBOTTSTOWN
## [34] ABEDEEEN ABERDEEN ABERNATHY
## [37] ABIENE ABILENE ABINGDON
## [40] ABINGTON ABIQUIU ABITA SPRINGS
## [43] ABRAMS ABSAROKEE ABSECON
## [46] ACAMPO ACCIDENT ACCOKEEK
## [49] ACCORD ACKERLY
## 14515 Levels: , ANTIOCH , MONTCLAIR :AGRANGE :LITHONIA ` 07013 ... ZWOLLE
And some values that seem to be too short or too long
## [1] "`" "A" "B" "C" "D" "E" "F" "G" "H" "I" "M" "N" "O" "Q" "R"
## [16] "S" "W" "3A" "AR" "AS"
## [1] "SOUTHEAST WASHINGTON" "SPOTSYLVANIA COURTHO" "SUMMERVILLE / LINCOL"
## [4] "THOUSAND ISLAND PARK" "TRUTH OR CONSEQUENCE" "UNION MILLSUNION MIL"
## [7] "VIRGINVIRGINIA BEACH" "WARRENSVILLE HEIGHTS" "WASHINGTON COURT HOU"
## [10] "WASHINGTON COURTHOUS" "WESLEY HILLS, NY 109" "WEST CHESTER TOWNSHI"
## [13] "WHITE RIVER JUNCTION" "WHITE SULPHUR SPRING" NA
State variable seems more standard
## [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"
Subsetting a smaller file, and removing pobs
nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- ppl[,nmz]
#dropping all pobs
x <- add$pob == 0
add <- add[x,]
Using the Address character length variable (add.len) to see the distribution
##
## 13 14 15 16 17 18 19 20 21 22
## 42360 59642 76075 88122 90147 86933 78423 65886 53984 42045
PPL Addresses are clustered around 18 characters.
Similar to NPO data, shorter addresses are unintelligeble.
## [1] "-" "E" "24" "24" "NA" "NA" "NA" "7" "2" "NA" "NA" "5" "NA" "-" "-"
## [1] "NONE" "2832" "230" "NONE" "RT 2" "VOID" "1655" "RR 4" "171" "564"
## [11] "564" "564" "564" "564" "NONE"
## [1] "20832" "20832" "20832" "20832" "14421" "DUEBER" "12950" "11022"
## [9] "44744" "ARROYO" "ARROYO" "530591" "530591" "530591" "530591"
And, again, Addresses with length 7-8 are recognizable as proper addressess.
## [1] "SERETARY" "59 WATER" "107 N ST" "3 JORDAN" "SOUTH ST" "GRAND ST"
## [7] "POLK AVE" "510 HYDE" "512 PARK" "108 POLO" "HEATHER" "RR3 H-21"
## [13] "RR3 H-21" "8 ALDEN" "16CR329" "ELM PARK" "3726 RFD" "5500 IDA"
## [19] "372ND ST" "POBX 387"
Looking for NAs within the other relevant columns
City var has some unintelligible values
x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]
## [1] - -- , LAKELAND
## [4] , LEWISVILLE , LINCOLNWOOD , SAN JOSE
## [7] , SOUTH OZONE PARK ,ISSOURI CITY ,MARLTON
## [10] ,MARSHALL ,MOKENA ;MARION
## [13] :BARRE :LAS VEGAS :LEWISVILLE
## [16] :LOS ANGELES . BOULDER . FULLERTON
## [19] . PHOENIX .CLARENCE 'WESTFIELD
## [22] / LEXINGTON, /COLORADO SPRINGS /JONESBORO
## [25] /SOUTHFIELD /WADMALAW ISLAND 0
## [28] 0750 OSLO NORWAY 0RLANDO 0XFORD
## [31] 0XNARD 1 NORMAL AVE 1000 VILLA ST
## [34] 102ND ST 10301 10454
## [37] 106 SE 7TH AVENUE 1071 SOUTHERN ARTERY 1101 W VICKSBURG PL
## [40] 11019 SLATER AVE 1107 FAIR OAKS AVE N 1109 WALNUT CIRCLE
## [43] 1111111 1112 WESTMINSTER AVE 11201
## [46] 11370 SE 33RD AVE 1155 STEVEN DR 116 31 STOCKHOLM SWE
## [49] 11701 NORTH WEST 12T 1205
## 26935 Levels: - -- , LAKELAND , LEWISVILLE , LINCOLNWOOD ... ZWOLLE
And some values that seem to be too short or too long
## [1] "-" "0" "2" "A" "b" "B" "C" "D" "E" "F" "G" "H" "J" "K" "L" "M" "N" "O" "P"
## [20] "Q"
## [1] "WASHINGTON COURTHOUS" "WEST CHESTER, PA 193" "WEST PALM BEACH, FLO"
## [4] "WEST PALWEST M BEACH" "WESTMORELAND JAMAICA" "WHITE RIVER JUNCTION"
## [7] "WHITE SULFUR SPRINGS" "WHITE SULPHUR SPRING" "WILLOUGHBY HILLS APT"
## [10] "WILMINGTON, DE 19801" "WOKINGHAM RG402DH UK" "WOKINGHAM RG412YH UK"
## [13] "WORRINGHAM, ZIMBABWE" "YOKOHAMA CITY, JAPAN" NA
State variable seems more standard
## [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"
For the Census geocoding, addresses should be formatted in the following fields:
CREATING Address files NPOAddresses_census.rds and PPLAddresses_census.rds
# npo <- readRDS( "Data/2_InputData/NONPROFITS-2014-2019v2.rds" )
npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
npo <- npo[,c(1,73,12:15,71)]
npo$ID <- 0
npo <- unique(npo)
npo <- npo[order(npo$input_address),]
npo$ID <- 1:nrow(npo)
rownames(npo) <- NULL
saveRDS(npo, "Data/3_GeoCensus/NPOAddresses_census.rds")
# ppl <- readRDS( "Data/2_InputData/PEOPLE-2014-2019v2.rds" )
ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
ppl <- ppl[,c(1,21,11:14,19)]
ppl$ID <- 0
ppl <- unique(ppl)
ppl <- ppl[order(ppl$input_address),]
ppl$ID <- 1:nrow(ppl)
rownames(ppl) <- NULL
saveRDS(ppl, "Data/3_GeoCensus/pplAddresses_census.rds")