Introduction

PENDING:

  • Fix results files
  • provide intro to census service and info on added variables and results (no match exact non exact, tie, etc.)
  • Include troubleshooting section with examples
  • what happened to the addresses that are “—-”? and what happened to the NPO IDs that are not unique?

In this script we will geocode NPO and PPL addresses using the Census Bureau geocoding service. The new dataset will be saved as a new version of the main files NONPROFITS-2014-2019v3.rds and PEOPLE-2014-2019v3.rds.

Steps

  1. We will load the NONPROFIT-2014-2019v2.rds and PEOPLE-2014-2019v2.rds files and produce input files NPOAddresses_census.rds and PPLAddresses_census.rds. These hold the data that will be passed through the geocoding service.
  2. Intro and demo of the Census geocoding service.
  3. Geocoding NPO addressess (NPOAddresses_census.rds) through the Census geocoding service. The script will yield raw output file NPOAddresses_censusGEO.rds. The new geocode information will be integrated into a new version of the main file NONPROFIT-2014-2019v3.rds
  4. Geocoding PPL addressess (PPLAddresses_census.rds) through the Census geocoding service. The script will yield raw output file PPLAddresses_censusGEO.rds. The new geocode information will be integrated into a new version of the main file PEOPLE-2014-2019v3.rds
  5. Troubleshooting

Notes * This script includes a troubleshooting section. * Geocoding can take several hours, for this reason some code chunks in this script are not evaluated. Outputs yielded from the process are loaded from stored files to ilustrate the results.

Packages

2. Demo: The Census Geocoding Service

-ADD BRIEF DESCRIPTION-

Additional information about the geocoding service can be found here:

This section runs a Demo to test the code is working. Addresses should be formatted in the following fields:

  • Unique ID,
  • House Number and Street Name,
  • City,
  • State,
  • ZIP Code

Geocode adds the following variables to the dataset:

  • match
  • match_type
  • out_address
  • lat_lon
  • tiger_line_id
  • tiger_line_side
  • state_fips
  • county_fips
  • tract_fips
  • block_fips
  • lon
  • lat

Geocode outputs from the Census can either be:

  • Match (Exact/Non_Exact): can be exact or approximate? irght? WHAT DOES THIS MEAN?
  • Tie:
  • No_Match:

Loading the Board Members Dataset (PPL) dataset and subsetting only key variables to test the demo

In the following code chunk we are executing the geocode query and storing results as csv files.

Results look like this:

Table continues below
id input_address match match_type
ID-2014-010468034-02 60 COMMUNITY DRIVE, AUGUSTA, ME, 04330 Match Exact
ID-2014-010468034-01 101 MAIN STREET, ROCKPORT, ME, 04856 Match Exact
ID-2014-010278788-03 P O BOX 1742, SACO, ME, 04072 No_Match
ID-2014-010278788-02 550 SOUTH WATERBORO ROAD, LYMAN, ME, 04002 Match Non_Exact
ID-2014-010512631-01 216 HUFFS MILL ROAD, BOWDOIN, ME, 04287 Match Exact
ID-2014-010278788-01 36 LORDS LANE, LYMAN, ME, 04002 Match Exact
Table continues below
out_address lat_lon tiger_line_id
60 Community Dr, AUGUSTA, ME, 04330 -69.79748,44.341183 75474285
101 Main St, ROCKPORT, ME, 04856 -69.0846,44.193134 75566531
550 E Waterboro Rd, LYMAN, ME, 04002 -70.65294,43.530483 92794668
216 Huffs Mill Rd, BOWDOIN, ME, 04287 -69.920235,44.107376 75613795
36 Lords Ln, LYMAN, ME, 04002 -70.62657,43.489647 92739980
Table continues below
tiger_line_side state_fips county_fips tract_fips block_fips
L 23 011 010200 2017
R 23 013 970500 1004
R 23 031 024500 2025
R 23 023 970200 3007
L 23 031 024500 3014
lon lat
-69.79748 44.341183
-69.0846 44.193134
NA NA
-70.65294 43.530483
-69.920235 44.107376
-70.62657 43.489647

3. Geocoding Nonprofit (NPO) addresses

Uploading the file NPOAddresses_census.rds

3.2 Geocoding NPO addresses

The following code chunk will pass the addresses files produced through the census geocode srvice. This will output the following per each address file:

  • Rresults[i].csv, with raw geocode results and
  • RresultsNPO[i].csv, wich has the same data but with splitted lat/lon fields.

In addition, a Geocode_Log.txt will be created for the whole process.

Note: Geocoding this amount of addresses will take significant hours. From our experience geocoding 1000 addresess took XXXX. ADD WHAT TO DO IF PROCESS HALTS IN THE MIDDLE

# create a folder for npo census geocoding files if needed:
# dir.create( "Data/3_GeoCensus/addresses_npo/2014-2019/Results" )

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_npo/2014-2019")
setwd(wd2)

# producing LOG file
log <- c("Query_Number", "Start_time", "Time_taken")
log <- paste(log, collapse=',')
log.name <- as.character(Sys.time())
log.name <- gsub(":","-",log.name)
log.name <- gsub(" ","-",log.name)
log.name <- paste0("Results/Geocode_Log_",log.name,".txt")
write(log, file=log.name, append = F)

#Geocoding loop:
for( i in 1:loops )
  { 
  #creating the objects used in each iteration of the loop: file name and api
  addressFile <- paste0( "AddressNPO",i,".csv" ) 
  apiurl <- "https://geocoding.geo.census.gov/geocoder/geographies/addressbatch"

  #outputs in console to track loop
  print( i )
  print(Sys.time() )
  start_time <- Sys.time()
  
  #Geocode query for i. Query is wrapped with try function to allow error-recovery
  try( 
    resp <- POST( apiurl, 
                  body=list(addressFile=upload_file(addressFile),
                            benchmark="Public_AR_Census2010",
                            vintage="Census2010_Census2010",
                            returntype="csv" ), 
                            encode="multipart" )
    )

  #documenting ending times
  end_time <- Sys.time()
  print( end_time - start_time ) #ouputting in R console

  #writing a line in the log file after query i ends
  query <- as.character(i)
  len <- as.character(end_time - start_time)
  start_time <- as.character(start_time)
  log <- c(query, start_time, len)
  log <- paste(log, collapse=',')
  write(log, file=log.name, append = T)

  #constructing the Results[i].csv filename.
  addressFile2 <- paste0( "Results/Results",i,".csv" ) 

  #creating column names to include in the results csv file
  var_names <- c( "id", "input_address",
                  "match", "match_type", 
                  "out_address", "lat_lon", 
                  "tiger_line_id", "tiger_line_side", 
                  "state_fips", "county_fips", 
                 "tract_fips", "block_fips" )
  v.names <- paste(var_names, collapse=',')
  
  #writing Rresults[i].csv, including headers
  writeLines( text=c(v.names, content(resp)) , con=addressFile2 )

  #reading file
  res <- read.csv( addressFile2, header=T, 
                 stringsAsFactors=F, 
                 colClasses="character" )

  # Splitting latitude and longitude values from results (res) to a variable (lat.lon)
  lat.lon <- strsplit( res$lat_lon, "," )
  
  #adding NAs to lat.lon empty fields
  for( j in 1:length(lat.lon) )
    {
    if( length( lat.lon[[j]] ) < 2 )
      lat.lon[[ j ]] <- c(NA,NA)
    }
  
  #tranforming the splitted lat.lons to columns that can be binded to a dataframe
  m <- matrix( unlist( lat.lon ), ncol=2, byrow=T )
  colnames(m) <- c("lon","lat")
  m <- as.data.frame( m )
  
  #Adding lat and lon values to raw results and writing ResultsNpo[i].csv file
  res <- cbind( res, m )
  write.csv( res, paste0("Results/ResultsNPO",i,".csv"), row.names=F )
  
  } # end of loop

# setting back wd
setwd(wd)

3.5 Exploring NPO Geocode Results

Lets take a look at the geolocations of our Nonprofits:

Summary of geocode (all):

There are 263272 NPO listed. with 256527 unique addresses

  frequency percent
Match 182042 69.15
No_Match 62215 23.63
Tie 3380 1.284
NA. 15635 5.939
TOTAL 263272 100

The following numbers of POBs

Non-POB POB
87.6 12.4

Summary of geocode excluding POBs:

  frequency percent
Match 180999 78.5
No_Match 32866 14.25
Tie 3302 1.432
NA. 13409 5.815
TOTAL 230576 100

Summary of geocode for only POBs:

  frequency percent
Match 1043 3.19
No_Match 29349 89.76
Tie 78 0.2386
NA. 2226 6.808
TOTAL 32696 100

4. Geocoding Board Member (PPL) addresses

Uploading the file PPLAddresses_census.rds

4.2 Geocoding PPL addresses

Passing the addresses files produced through the census geocode srvice. This will output the following per each address file:

  • Rresults[i].csv, with raw geocode results and
  • RresultsPPL[i].csv, wich has the same data but with splitted lat/lon fields.

In addition, a Geocode_Log.txt will be created for the whole process.

Note: Geocoding this amount of addresses will take significant hours. From our experience geocoding 1000 addresess took XXXX. ADD WHAT TO DO IF PROCESS HALTS IN THE MIDDLE

# create a folder for ppl census geocoding files if needed:
# dir.create( "Data/3_GeoCensus/addresses_ppl/2014-2019/Results" )

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_ppl/2014-2019")
setwd(wd2)

# producing LOG file
log <- c("Query_Number", "Start_time", "Time_taken")
log <- paste(log, collapse=',')
log.name <- as.character(Sys.time())
log.name <- gsub(":","-",log.name)
log.name <- gsub(" ","-",log.name)
log.name <- paste0("Results/Geocode_Log_",log.name,".txt")
write(log, file=log.name, append = F)

#Geocoding loop:
for( i in 1:loops )
  { 
  #creating the objects that will be used in each iteration of the loop: file name of addresses and api
  addressFile <- paste0( "AddressPPL",i,".csv" ) 
  apiurl <- "https://geocoding.geo.census.gov/geocoder/geographies/addressbatch"

  #outputs in console to track loop
  print( i )
  print(Sys.time() )
  start_time <- Sys.time()
  
  #Geocode query for i. Query is wrapped with try function to allow error-recovery
  try( 
    resp <- POST( apiurl, 
                  body=list(addressFile=upload_file(addressFile),
                            benchmark="Public_AR_Census2010",
                            vintage="Census2010_Census2010",
                            returntype="csv" ), 
                            encode="multipart" )
    )

  #documenting ending times
  end_time <- Sys.time()
  print( end_time - start_time ) #ouputting in R console

  #writing a line in the log file after query i ends
  query <- as.character(i)
  len <- as.character(end_time - start_time)
  start_time <- as.character(start_time)
  log <- c(query, start_time, len)
  log <- paste(log, collapse=',')
  write(log, file=log.name, append = T)

  #constructing the Results[i].csv filename.
  addressFile2 <- paste0( "Results/Results",i,".csv" ) 

  #creating column names to include in the results csv file
  var_names <- c( "id", "input_address",
                  "match", "match_type", 
                  "out_address", "lat_lon", 
                  "tiger_line_id", "tiger_line_side", 
                  "state_fips", "county_fips", 
                 "tract_fips", "block_fips" )
  v.names <- paste(var_names, collapse=',')
  
  #writing Rresults[i].csv, including headers
  writeLines( text=c(v.names, content(resp)) , con=addressFile2 )

  #reading file
  res <- read.csv( addressFile2, header=T, 
                 stringsAsFactors=F, 
                 colClasses="character" )

  # Splitting latitude and longitude values from results (res) to a variable (lat.lon)
  lat.lon <- strsplit( res$lat_lon, "," )
  
  #adding NAs to lat.lon empty fields
  for( j in 1:length(lat.lon) )
    {
    if( length( lat.lon[[j]] ) < 2 )
      lat.lon[[ j ]] <- c(NA,NA)
    }
  
  #tranforming the splitted lat.lons to columns that can be binded to a dataframe
  m <- matrix( unlist( lat.lon ), ncol=2, byrow=T )
  colnames(m) <- c("lon","lat")
  m <- as.data.frame( m )
  
  #Adding lat and lon values to raw results and writing ResultsPPL[i].csv file
  res <- cbind( res, m )
  write.csv( res, paste0("Results/ResultsPPL",i,".csv"), row.names=F )
  
  } # end of loop

# setting back wd
setwd(wd)

4.5 Exploring PPL Geocode Results

Lets take a look at the geolocations of our Board Members:

Summary of geocoding process

There are 946093 NPO listed. with 729304 unique addresses

  frequency percent
Match 689983 72.93
No_Match 183299 19.37
Tie 11591 1.225
NA. 61220 6.471
TOTAL 946093 100

The following numbers of POBs

Non-POB POB
94.1 5.9

Summary of geocoding process excluding POBs:

  frequency percent
Match 688679 77.33
No_Match 132445 14.87
Tie 11513 1.293
NA. 57933 6.505
TOTAL 890570 100

Summary of geocoding process for only POBs:

  frequency percent
Match 1304 2.349
No_Match 50854 91.59
Tie 78 0.1405
NA. 3287 5.92
TOTAL 55523 100

5. Troubleshooting

In the case a geocode process is aborted before finishing, you might need to geocode the process again. The code below helps to compile all geocode results into one.

ADD examples…