In this script we will work with the NONPROFIT-2014-2019.rds and PEOPLE-2014-2019.rds files to:

Identifying Addresses that are PO Boxes. POBs will be geocoded through their zips or city centroids
Exploring data gaps in the addresses data
Creating an input_address variable NEED TO UPDATE THIS IN THE SCRIPT

New datasets will be stored as NONPROFIT-2014-2019v2.rds and PEOPLE-2014-2019v2.rds. In addition address files NPOAddresses_census.rds and PPLAddresses_census.rds will be created to pass them through the Census Geocoding service.

Packages

library( dplyr )
library(tidyverse)
library( tidyr )
library( pander )
library( httr )
library( stringr )

#update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd) 
dir()

##  [1] "_Archive"                       "Data"                          
##  [3] "Data2"                          "Documentation"                 
##  [5] "GeoProcessReport.html"          "GeoProcessReport.Rmd"          
##  [7] "Icon\r"                         "Images"                        
##  [9] "Research Note"                  "Step-01-ProcessRawData.html"   
## [11] "Step-01-ProcessRawData.Rmd"     "Step-02-POBsandGaps.html"      
## [13] "Step-02-POBsandGaps.Rmd"        "Step-03-CensusGeo.Rmd"         
## [15] "Step-03-fix.Rmd"                "Step-03.1-CensusGeo.html"      
## [17] "Step-03.2-CompilingResults.Rmd" "Step-03.3-SecondCensusGeo.Rmd" 
## [19] "Step-04-GoogleGeo.Rmd"          "Step-05-ZipsandCityGeo.Rmd"    
## [21] "Step-06-Disambiguation.Rmd"     "StepsTable.fld"                
## [23] "StepsTable.html"                "zip-codes"

1. Identifying PO Boxes

We will be looking for POB (Post Office Boxes). Might also find APOs (Army Post Office), FPOs (Fleet Post Office) and DPOs (Diplomatic Post Office).

1.1 Developing a regular expression code to identify PO Boxes

1.1 Identifying POBs in NPO dataset

Loading file and creating a POB variable to flag addresses that are pobs.

#loading data
npo <- readRDS("Data/2_InputData/NONPROFITS-2014-2019.rds")       

#creating a variable for pob flags
npo$pob <- NA

Identifying the pobs using a regular expression code

x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
pob <- npo$Address[which(x)]
head(pob, 5)

## [1] "P O BOX 284" "PO BOX 1111" "PO BOX 261"  "PO BOX 98"   "PO BOX 196"

Addressess identified as POBs are sometimes very lengthy…

#including a character length var
pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)

#arranging it
pobs <- arrange(pobs, desc(pob.len))
head(pobs)

#plotting
x <- table(pobs$pob.len)
barplot(x)

pander(x[3:11])

7	8	9	10	11	12	13	14	15
155	452	2197	10804	8907	4277	2838	446	60

Most NPO POB addresses are clusterd around the 11 character length.

After taking a closer look, we notice that lengthy POB addresses have a mix of both POB info and a regular address.

Take a look at POBs with 25 characters

head(pobs[pobs$pob.len==25,], 5)

A POB flag and address character length var will be added to the actual dataset

#char length
npo$add.len <- nchar(npo$Address)

#pob flag
x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", npo$Address,ignore.case=TRUE)
npo$pob <- as.numeric(x) #adding a flag for pobs identified

How many pobs?

x <- as.data.frame(prop.table(table(npo$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)

pob	percent
0	87.6 %
1	12.4 %

We will merge the ADDRESS, CITY, STATE and ZIP variable to create an INPUT ADDRESS var in the NPO dataset for geocoding.

#creating an Input Address variable
npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ")

Saving the npo file

saveRDS(npo, "Data/2_InputData/NONPROFITS-2014-2019v2.rds")

1.2 Identifying POBs in BOARD MEMBERS dataset

Loading file and creating a pob variable to flag addresses that are pobs.

#loading data
ppl <- readRDS("Data/2_InputData/PEOPLE-2014-2019.rds")   

#creating a variable for pob flags
ppl$pob <- NA

Identifying the pobs using a regular expression code

x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
pob <- ppl$Address[which(x)]
head(pob, 5)

## [1] "P O BOX 1742" "RR2 BOX 6464" "PO BOX 7"     "PO BOX 7"     "PO BOX 7"

Similar to the NPO addresses, some POBs are mixedwith regular address info:

pob.len <- nchar(pob)
pobs <- tibble(pob,pob.len)
pobs <- arrange(pobs, desc(pob.len))
head(pobs)

#plotting
x <- table(pobs$pob.len)
barplot(x)

pander(x[4:12])

7	8	9	10	11	12	13	14	15
422	834	3801	19002	14801	6833	5138	826	160

Most ppl POB addresses are also clusterd around the 11 character length.

Adding POB flag and add.len var

ppl$add.len <- nchar(ppl$Address)

x <- grepl("Post Office|POST OFFICE BOX|PO BOX|POBOX|BOX\\s*\\d|POB\\s*\\d|CPO\\s*\\d|PO B\\s*\\d|^PO\\s*\\d+$", ppl$Address,ignore.case=TRUE)
ppl$pob <- as.numeric(x) #adding a flag for pobs identified

How many pobs?

x <- as.data.frame(prop.table(table(ppl$pob)))
names(x) <- c("pob","percent")
x$percent <- paste0(round(x$percent*100,1)," %")
pander(x)

pob	percent
0	94.1 %
1	5.9 %

We will merge the ADDRESS, CITY, STATE and ZIP variables to create an INPUT ADDRESS var in the ppl dataset for geocoding.

#creating an Input Address variable
ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes

Saving rds

saveRDS(ppl, "Data/2_InputData/PEOPLE-2014-2019v2.rds")

2. Exploring Address data gaps

In this section we will look into the Addresses and try to identify any problem with the data. We will also create an address_input variable which will be used for geocoding.

2.1 NonProfit dataset

The variables that contain address informaiton are:

“Address”
“City”
“State”
“Zip”

Subsetting a smaller file to explore addresses, removing pobs

nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- npo[,nmz]

#dropping all pobs
x <- add$pob == 0
add <- add[x,]

NPO ADDRESSES have 3 NAs and 0 blanks

Using the Address character length variable (add.len) to see the distribution and determine when an address is short enough to assume it is unintelligible.

x <- table(add$add.len)
barplot(x)

x[13:21]

## 
##    13    14    15    16    17    18    19    20    21 
##  9176 13270 17502 21068 21944 21499 19681 16861 14434

NPO Addresses are clustered around 18 characters

Addresses with length 1-2 are unintelligible and will probably have to be geocoded using zip codes or city centers

x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)

##  [1] "68" "38" "26" "4"  "68" "E"  "55" "PO" "32" "81" "N"  "13" "79" "91" "24"

Looking at those with length 3-4, they still look unintelligible

x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)

##  [1] "573"  "332"  "7472" "1377" "7232" "564"  "659"  "463"  "567"  "339" 
## [11] "734"  "782"  "7235" "3815" "6003"

Addresses with length 5-6 also seem to be numbers with not much sense

x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)

##  [1] "20832"  "66612"  "720038" "330322" "12492"  "11855"  "530591" "490141"
##  [9] "950691" "26957"  "65482"  "11065"  "42368"  "342341" "65745"

However, looking at those with length 7-8, we start to recognize proper addressess.

x <- which(add$add.len %in% c(7,8))
head(add$Address[x],15)

##  [1] "C/O 356"  "P O 4911" "HWY 37 N" "919 BERT" "218 LAWN" "276 KING"
##  [7] "45 TESLA" "45 TESLA" "408 J ST" "301 UCB"  "SUITE 8"  "3 DEODAR"
## [13] "832 30TH" "49 RR 2"  "497 MAIN"

Looking for NAs within the other relevant columns

CITY variable has 11 NAs and 1 blanks

City var has some unintelligible values

x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]

##  [1] , ANTIOCH            , MONTCLAIR          :AGRANGE            
##  [4] :LITHONIA            `                    07013               
##  [7] 08088-000            0WASSO               1.   PACOIMA        
## [10] 11231                13035 OLIVE BLVD     1474 BERGER DR      
## [13] 20748                2212 SW 55TH         33162               
## [16] 37 WOODS DRIVE       3A                   45434               
## [19] 46TH COURT           48723                50309               
## [22] 50674                54911                63103               
## [25] 63136                76052                77004               
## [28] 818 FARMERS UNION RD 93420                A                   
## [31] AARONSBURG           ABBEVILLE            ABBOTTSTOWN         
## [34] ABEDEEEN             ABERDEEN             ABERNATHY           
## [37] ABIENE               ABILENE              ABINGDON            
## [40] ABINGTON             ABIQUIU              ABITA SPRINGS       
## [43] ABRAMS               ABSAROKEE            ABSECON             
## [46] ACAMPO               ACCIDENT             ACCOKEEK            
## [49] ACCORD               ACKERLY             
## 14515 Levels: , ANTIOCH , MONTCLAIR :AGRANGE :LITHONIA ` 07013 ... ZWOLLE

And some values that seem to be too short or too long

x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))

x <- arrange(x, len)
head(x$City,20)

##  [1] "`"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "M"  "N"  "O"  "Q"  "R" 
## [16] "S"  "W"  "3A" "AR" "AS"

tail(x$City,15)

##  [1] "SOUTHEAST WASHINGTON" "SPOTSYLVANIA COURTHO" "SUMMERVILLE / LINCOL"
##  [4] "THOUSAND ISLAND PARK" "TRUTH OR CONSEQUENCE" "UNION MILLSUNION MIL"
##  [7] "VIRGINVIRGINIA BEACH" "WARRENSVILLE HEIGHTS" "WASHINGTON COURT HOU"
## [10] "WASHINGTON COURTHOUS" "WESLEY HILLS, NY 109" "WEST CHESTER TOWNSHI"
## [13] "WHITE RIVER JUNCTION" "WHITE SULPHUR SPRING" NA

STATE variable has 47 NAs and 1

State variable seems more standard

sort(unique(add$State))

##  [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"

ZIP variable has 0 NAs and 0

2.2 Board Member dataset

Subsetting a smaller file, and removing pobs

nmz <- c("ID", "ORGNAME", "Address", "City", "State", "Zip", "pob", "add.len")
add <- ppl[,nmz]

#dropping all pobs
x <- add$pob == 0
add <- add[x,]

PPL ADDRESSES have 403 NAs and 0 blanks

Using the Address character length variable (add.len) to see the distribution

x <- table(add$add.len)
barplot(x)

x[13:22]

## 
##    13    14    15    16    17    18    19    20    21    22 
## 42360 59642 76075 88122 90147 86933 78423 65886 53984 42045

PPL Addresses are clustered around 18 characters.

Similar to NPO data, shorter addresses are unintelligeble.

x <- which(add$add.len %in% c(1,2))
head(add$Address[x],15)

##  [1] "-"  "E"  "24" "24" "NA" "NA" "NA" "7"  "2"  "NA" "NA" "5"  "NA" "-"  "-"

x <- which(add$add.len %in% c(3,4))
head(add$Address[x],15)

##  [1] "NONE" "2832" "230"  "NONE" "RT 2" "VOID" "1655" "RR 4" "171"  "564" 
## [11] "564"  "564"  "564"  "564"  "NONE"

x <- which(add$add.len %in% c(5,6))
head(add$Address[x],15)

##  [1] "20832"  "20832"  "20832"  "20832"  "14421"  "DUEBER" "12950"  "11022" 
##  [9] "44744"  "ARROYO" "ARROYO" "530591" "530591" "530591" "530591"

And, again, Addresses with length 7-8 are recognizable as proper addressess.

x <- which(add$add.len %in% c(7,8))
head(add$Address[x],20)

##  [1] "SERETARY" "59 WATER" "107 N ST" "3 JORDAN" "SOUTH ST" "GRAND ST"
##  [7] "POLK AVE" "510 HYDE" "512 PARK" "108 POLO" "HEATHER"  "RR3 H-21"
## [13] "RR3 H-21" "8 ALDEN"  "16CR329"  "ELM PARK" "3726 RFD" "5500 IDA"
## [19] "372ND ST" "POBX 387"

Looking for NAs within the other relevant columns

CITY variable has 220 NAs and 0 blanks

City var has some unintelligible values

x <- as.data.frame(table(add$City, useNA = "always"))
names(x)[1] <- "City"
x <- arrange(x, City)
x$City[1:50]

##  [1] -                    --                   , LAKELAND          
##  [4] , LEWISVILLE         , LINCOLNWOOD        , SAN JOSE          
##  [7] , SOUTH OZONE PARK   ,ISSOURI CITY        ,MARLTON            
## [10] ,MARSHALL            ,MOKENA              ;MARION             
## [13] :BARRE               :LAS VEGAS           :LEWISVILLE         
## [16] :LOS ANGELES         . BOULDER            . FULLERTON         
## [19] . PHOENIX            .CLARENCE            'WESTFIELD          
## [22] / LEXINGTON,         /COLORADO SPRINGS    /JONESBORO          
## [25] /SOUTHFIELD          /WADMALAW ISLAND     0                   
## [28] 0750 OSLO NORWAY     0RLANDO              0XFORD              
## [31] 0XNARD               1 NORMAL AVE         1000 VILLA ST       
## [34] 102ND ST             10301                10454               
## [37] 106 SE 7TH AVENUE    1071 SOUTHERN ARTERY 1101 W VICKSBURG PL 
## [40] 11019 SLATER AVE     1107 FAIR OAKS AVE N 1109 WALNUT CIRCLE  
## [43] 1111111              1112 WESTMINSTER AVE 11201               
## [46] 11370 SE 33RD AVE    1155 STEVEN DR       116 31 STOCKHOLM SWE
## [49] 11701 NORTH WEST 12T 1205                
## 26935 Levels: - -- , LAKELAND , LEWISVILLE , LINCOLNWOOD ... ZWOLLE

And some values that seem to be too short or too long

x$City <- as.character(x$City)
x$len <- nchar(x$City)
barplot(table(x$len))

x <- arrange(x, len)
head(x$City,20)

##  [1] "-" "0" "2" "A" "b" "B" "C" "D" "E" "F" "G" "H" "J" "K" "L" "M" "N" "O" "P"
## [20] "Q"

tail(x$City,15)

##  [1] "WASHINGTON COURTHOUS" "WEST CHESTER, PA 193" "WEST PALM BEACH, FLO"
##  [4] "WEST PALWEST M BEACH" "WESTMORELAND JAMAICA" "WHITE RIVER JUNCTION"
##  [7] "WHITE SULFUR SPRINGS" "WHITE SULPHUR SPRING" "WILLOUGHBY HILLS APT"
## [10] "WILMINGTON, DE 19801" "WOKINGHAM RG402DH UK" "WOKINGHAM RG412YH UK"
## [13] "WORRINGHAM, ZIMBABWE" "YOKOHAMA CITY, JAPAN" NA

STATE variable has 1484 NAs and 0 blanks

State variable seems more standard

sort(unique(add$State))

##  [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"

ZIP variable has 3024 NAs and 0 blanks

3. Creating Addresses files to pass through the Census Geocode service

For the Census geocoding, addresses should be formatted in the following fields:

Unique ID,
House Number and Street Name,
City,
State,
ZIP Code

CREATING Address files NPOAddresses_census.rds and PPLAddresses_census.rds

3.1 Creating a NPO input_address dataset

# npo <- readRDS( "Data/2_InputData/NONPROFITS-2014-2019v2.rds" )

npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
npo <- npo[,c(1,73,12:15,71)]
npo$ID <- 0
npo <- unique(npo)
npo <- npo[order(npo$input_address),]
npo$ID <- 1:nrow(npo)
rownames(npo) <- NULL

saveRDS(npo, "Data/3_GeoCensus/NPOAddresses_census.rds")

3.2 Creating a PPL input_address dataset

# ppl <- readRDS( "Data/2_InputData/PEOPLE-2014-2019v2.rds" )

ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes

ppl <- ppl[,c(1,21,11:14,19)]
ppl$ID <- 0
ppl <- unique(ppl)

ppl <- ppl[order(ppl$input_address),]
ppl$ID <- 1:nrow(ppl)
rownames(ppl) <- NULL

saveRDS(ppl, "Data/3_GeoCensus/pplAddresses_census.rds")

Step 2. Identifying POBs and checking data

3/19/2019