In this script we will work with the NONPROFIT-2014-2019.rds and PEOPLE-2014-2019.rds files to:

  1. Identifying Addresses that are PO Boxes. POBs will be geocoded through their zips or city centroids
  2. Exploring data gaps in the addresses data
  3. Creating an input_address variable NEED TO UPDATE THIS IN THE SCRIPT

New datasets will be stored as NONPROFIT-2014-2019v2.rds and PEOPLE-2014-2019v2.rds. In addition address files NPOAddresses_census.rds and PPLAddresses_census.rds will be created to pass them through the Census Geocoding service.

Packages

##  [1] "_Archive"                       "Data"                          
##  [3] "Data2"                          "Documentation"                 
##  [5] "GeoProcessReport.html"          "GeoProcessReport.Rmd"          
##  [7] "Icon\r"                         "Images"                        
##  [9] "Research Note"                  "Step-01-ProcessRawData.html"   
## [11] "Step-01-ProcessRawData.Rmd"     "Step-02-POBsandGaps.html"      
## [13] "Step-02-POBsandGaps.Rmd"        "Step-03-CensusGeo.Rmd"         
## [15] "Step-03-fix.Rmd"                "Step-03.1-CensusGeo.html"      
## [17] "Step-03.2-CompilingResults.Rmd" "Step-03.3-SecondCensusGeo.Rmd" 
## [19] "Step-04-GoogleGeo.Rmd"          "Step-05-ZipsandCityGeo.Rmd"    
## [21] "Step-06-Disambiguation.Rmd"     "StepsTable.fld"                
## [23] "StepsTable.html"                "zip-codes"

1. Identifying PO Boxes

We will be looking for POB (Post Office Boxes). Might also find APOs (Army Post Office), FPOs (Fleet Post Office) and DPOs (Diplomatic Post Office).

1.1 Developing a regular expression code to identify PO Boxes

1.1 Identifying POBs in NPO dataset

Loading file and creating a POB variable to flag addresses that are pobs.

Identifying the pobs using a regular expression code

## [1] "P O BOX 284" "PO BOX 1111" "PO BOX 261"  "PO BOX 98"   "PO BOX 196"

Addressess identified as POBs are sometimes very lengthy…

7 8 9 10 11 12 13 14 15
155 452 2197 10804 8907 4277 2838 446 60

Most NPO POB addresses are clusterd around the 11 character length.

After taking a closer look, we notice that lengthy POB addresses have a mix of both POB info and a regular address.

Take a look at POBs with 25 characters

A POB flag and address character length var will be added to the actual dataset

How many pobs?

pob percent
0 87.6 %
1 12.4 %

We will merge the ADDRESS, CITY, STATE and ZIP variable to create an INPUT ADDRESS var in the NPO dataset for geocoding.

Saving the npo file

1.2 Identifying POBs in BOARD MEMBERS dataset

Loading file and creating a pob variable to flag addresses that are pobs.

Identifying the pobs using a regular expression code

## [1] "P O BOX 1742" "RR2 BOX 6464" "PO BOX 7"     "PO BOX 7"     "PO BOX 7"

Similar to the NPO addresses, some POBs are mixedwith regular address info:

7 8 9 10 11 12 13 14 15
422 834 3801 19002 14801 6833 5138 826 160

Most ppl POB addresses are also clusterd around the 11 character length.

Adding POB flag and add.len var

How many pobs?

pob percent
0 94.1 %
1 5.9 %

We will merge the ADDRESS, CITY, STATE and ZIP variables to create an INPUT ADDRESS var in the ppl dataset for geocoding.

Saving rds

2. Exploring Address data gaps

In this section we will look into the Addresses and try to identify any problem with the data. We will also create an address_input variable which will be used for geocoding.

2.1 NonProfit dataset

The variables that contain address informaiton are:

  • “Address”
  • “City”
  • “State”
  • “Zip”

Subsetting a smaller file to explore addresses, removing pobs

  • NPO ADDRESSES have 3 NAs and 0 blanks

Using the Address character length variable (add.len) to see the distribution and determine when an address is short enough to assume it is unintelligible.

## 
##    13    14    15    16    17    18    19    20    21 
##  9176 13270 17502 21068 21944 21499 19681 16861 14434

NPO Addresses are clustered around 18 characters

Addresses with length 1-2 are unintelligible and will probably have to be geocoded using zip codes or city centers

##  [1] "68" "38" "26" "4"  "68" "E"  "55" "PO" "32" "81" "N"  "13" "79" "91" "24"

Looking at those with length 3-4, they still look unintelligible

##  [1] "573"  "332"  "7472" "1377" "7232" "564"  "659"  "463"  "567"  "339" 
## [11] "734"  "782"  "7235" "3815" "6003"

Addresses with length 5-6 also seem to be numbers with not much sense

##  [1] "20832"  "66612"  "720038" "330322" "12492"  "11855"  "530591" "490141"
##  [9] "950691" "26957"  "65482"  "11065"  "42368"  "342341" "65745"

However, looking at those with length 7-8, we start to recognize proper addressess.

##  [1] "C/O 356"  "P O 4911" "HWY 37 N" "919 BERT" "218 LAWN" "276 KING"
##  [7] "45 TESLA" "45 TESLA" "408 J ST" "301 UCB"  "SUITE 8"  "3 DEODAR"
## [13] "832 30TH" "49 RR 2"  "497 MAIN"

Looking for NAs within the other relevant columns

  • CITY variable has 11 NAs and 1 blanks

City var has some unintelligible values

##  [1] , ANTIOCH            , MONTCLAIR          :AGRANGE            
##  [4] :LITHONIA            `                    07013               
##  [7] 08088-000            0WASSO               1.   PACOIMA        
## [10] 11231                13035 OLIVE BLVD     1474 BERGER DR      
## [13] 20748                2212 SW 55TH         33162               
## [16] 37 WOODS DRIVE       3A                   45434               
## [19] 46TH COURT           48723                50309               
## [22] 50674                54911                63103               
## [25] 63136                76052                77004               
## [28] 818 FARMERS UNION RD 93420                A                   
## [31] AARONSBURG           ABBEVILLE            ABBOTTSTOWN         
## [34] ABEDEEEN             ABERDEEN             ABERNATHY           
## [37] ABIENE               ABILENE              ABINGDON            
## [40] ABINGTON             ABIQUIU              ABITA SPRINGS       
## [43] ABRAMS               ABSAROKEE            ABSECON             
## [46] ACAMPO               ACCIDENT             ACCOKEEK            
## [49] ACCORD               ACKERLY             
## 14515 Levels: , ANTIOCH , MONTCLAIR :AGRANGE :LITHONIA ` 07013 ... ZWOLLE

And some values that seem to be too short or too long

##  [1] "`"  "A"  "B"  "C"  "D"  "E"  "F"  "G"  "H"  "I"  "M"  "N"  "O"  "Q"  "R" 
## [16] "S"  "W"  "3A" "AR" "AS"
##  [1] "SOUTHEAST WASHINGTON" "SPOTSYLVANIA COURTHO" "SUMMERVILLE / LINCOL"
##  [4] "THOUSAND ISLAND PARK" "TRUTH OR CONSEQUENCE" "UNION MILLSUNION MIL"
##  [7] "VIRGINVIRGINIA BEACH" "WARRENSVILLE HEIGHTS" "WASHINGTON COURT HOU"
## [10] "WASHINGTON COURTHOUS" "WESLEY HILLS, NY 109" "WEST CHESTER TOWNSHI"
## [13] "WHITE RIVER JUNCTION" "WHITE SULPHUR SPRING" NA
  • STATE variable has 47 NAs and 1

State variable seems more standard

##  [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"
  • ZIP variable has 0 NAs and 0

2.2 Board Member dataset

Subsetting a smaller file, and removing pobs

  • PPL ADDRESSES have 403 NAs and 0 blanks

Using the Address character length variable (add.len) to see the distribution

## 
##    13    14    15    16    17    18    19    20    21    22 
## 42360 59642 76075 88122 90147 86933 78423 65886 53984 42045

PPL Addresses are clustered around 18 characters.

Similar to NPO data, shorter addresses are unintelligeble.

##  [1] "-"  "E"  "24" "24" "NA" "NA" "NA" "7"  "2"  "NA" "NA" "5"  "NA" "-"  "-"
##  [1] "NONE" "2832" "230"  "NONE" "RT 2" "VOID" "1655" "RR 4" "171"  "564" 
## [11] "564"  "564"  "564"  "564"  "NONE"
##  [1] "20832"  "20832"  "20832"  "20832"  "14421"  "DUEBER" "12950"  "11022" 
##  [9] "44744"  "ARROYO" "ARROYO" "530591" "530591" "530591" "530591"

And, again, Addresses with length 7-8 are recognizable as proper addressess.

##  [1] "SERETARY" "59 WATER" "107 N ST" "3 JORDAN" "SOUTH ST" "GRAND ST"
##  [7] "POLK AVE" "510 HYDE" "512 PARK" "108 POLO" "HEATHER"  "RR3 H-21"
## [13] "RR3 H-21" "8 ALDEN"  "16CR329"  "ELM PARK" "3726 RFD" "5500 IDA"
## [19] "372ND ST" "POBX 387"

Looking for NAs within the other relevant columns

  • CITY variable has 220 NAs and 0 blanks

City var has some unintelligible values

##  [1] -                    --                   , LAKELAND          
##  [4] , LEWISVILLE         , LINCOLNWOOD        , SAN JOSE          
##  [7] , SOUTH OZONE PARK   ,ISSOURI CITY        ,MARLTON            
## [10] ,MARSHALL            ,MOKENA              ;MARION             
## [13] :BARRE               :LAS VEGAS           :LEWISVILLE         
## [16] :LOS ANGELES         . BOULDER            . FULLERTON         
## [19] . PHOENIX            .CLARENCE            'WESTFIELD          
## [22] / LEXINGTON,         /COLORADO SPRINGS    /JONESBORO          
## [25] /SOUTHFIELD          /WADMALAW ISLAND     0                   
## [28] 0750 OSLO NORWAY     0RLANDO              0XFORD              
## [31] 0XNARD               1 NORMAL AVE         1000 VILLA ST       
## [34] 102ND ST             10301                10454               
## [37] 106 SE 7TH AVENUE    1071 SOUTHERN ARTERY 1101 W VICKSBURG PL 
## [40] 11019 SLATER AVE     1107 FAIR OAKS AVE N 1109 WALNUT CIRCLE  
## [43] 1111111              1112 WESTMINSTER AVE 11201               
## [46] 11370 SE 33RD AVE    1155 STEVEN DR       116 31 STOCKHOLM SWE
## [49] 11701 NORTH WEST 12T 1205                
## 26935 Levels: - -- , LAKELAND , LEWISVILLE , LINCOLNWOOD ... ZWOLLE

And some values that seem to be too short or too long

##  [1] "-" "0" "2" "A" "b" "B" "C" "D" "E" "F" "G" "H" "J" "K" "L" "M" "N" "O" "P"
## [20] "Q"
##  [1] "WASHINGTON COURTHOUS" "WEST CHESTER, PA 193" "WEST PALM BEACH, FLO"
##  [4] "WEST PALWEST M BEACH" "WESTMORELAND JAMAICA" "WHITE RIVER JUNCTION"
##  [7] "WHITE SULFUR SPRINGS" "WHITE SULPHUR SPRING" "WILLOUGHBY HILLS APT"
## [10] "WILMINGTON, DE 19801" "WOKINGHAM RG402DH UK" "WOKINGHAM RG412YH UK"
## [13] "WORRINGHAM, ZIMBABWE" "YOKOHAMA CITY, JAPAN" NA
  • STATE variable has 1484 NAs and 0 blanks

State variable seems more standard

##  [1] "AA" "AE" "AK" "AL" "AP" "AR" "AS" "AZ" "CA" "CO" "CT" "DC" "DE" "FL" "FM"
## [16] "GA" "GU" "HI" "IA" "ID" "IL" "IN" "KS" "KY" "LA" "MA" "MD" "ME" "MH" "MI"
## [31] "MN" "MO" "MP" "MS" "MT" "NC" "ND" "NE" "NH" "NJ" "NM" "NV" "NY" "OH" "OK"
## [46] "OR" "PA" "PR" "PW" "RI" "SC" "SD" "TN" "TX" "UM" "UT" "VA" "VI" "VT" "WA"
## [61] "WI" "WV" "WY"
  • ZIP variable has 3024 NAs and 0 blanks