Introduction

PENDING:

Fix results files
provide intro to census service and info on added variables and results (no match exact non exact, tie, etc.)
Include troubleshooting section with examples
what happened to the addresses that are “—-”? and what happened to the NPO IDs that are not unique?

In this script we will geocode NPO and PPL addresses using the Census Bureau geocoding service. The new dataset will be saved as a new version of the main files NONPROFITS-2014-2019v3.rds and PEOPLE-2014-2019v3.rds.

Steps

We will load the NONPROFIT-2014-2019v2.rds and PEOPLE-2014-2019v2.rds files and produce input files NPOAddresses_census.rds and PPLAddresses_census.rds. These hold the data that will be passed through the geocoding service.
Intro and demo of the Census geocoding service.
Geocoding NPO addressess (NPOAddresses_census.rds) through the Census geocoding service. The script will yield raw output file NPOAddresses_censusGEO.rds. The new geocode information will be integrated into a new version of the main file NONPROFIT-2014-2019v3.rds
Geocoding PPL addressess (PPLAddresses_census.rds) through the Census geocoding service. The script will yield raw output file PPLAddresses_censusGEO.rds. The new geocode information will be integrated into a new version of the main file PEOPLE-2014-2019v3.rds
Troubleshooting

Notes * This script includes a troubleshooting section. * Geocoding can take several hours, for this reason some code chunks in this script are not evaluated. Outputs yielded from the process are loaded from stored files to ilustrate the results.

Packages

library( dplyr )
library( tidyr )
library( pander )
library( httr )

# update the path with your working directory:
wd <- "/Users/icps86/Dropbox/R Projects/Open_data_ignacio"
setwd(wd)

1. Subsetting input files

For the Census geocoding, addresses should be formatted in the following fields:

Unique ID,
House Number and Street Name,
City,
State,
ZIP Code

CREATING Address files NPOAddresses_census.rds and PPLAddresses_census.rds

1.1 Creating a NPO input_address dataset

npo <- readRDS( "Data/2_InputData/NONPROFITS-2014-2019v2.rds" )

npo$input_address <- paste(npo$Address, npo$City, npo$State, npo$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes
npo <- npo[,c(1,73,12:15,71)]
npo$ID <- 0
npo <- unique(npo)
npo <- npo[order(npo$input_address),]
npo$ID <- 1:nrow(npo)
rownames(npo) <- NULL

saveRDS(npo, "Data/3_GeoCensus/NPOAddresses_census.rds")

1.2 Creating a PPL input_address dataset

ppl <- readRDS( "Data/2_InputData/PEOPLE-2014-2019v2.rds" )

ppl$input_address <- paste(ppl$Address, ppl$City, ppl$State, ppl$Zip, sep = ", ") #creating an input_address field to match the geocode dataframes

ppl <- ppl[,c(1,21,11:14,19)]
ppl$ID <- 0
ppl <- unique(ppl)

ppl <- ppl[order(ppl$input_address),]
ppl$ID <- 1:nrow(ppl)
rownames(ppl) <- NULL

saveRDS(ppl, "Data/3_GeoCensus/PPLAddresses_census.rds")

2. Demo: The Census Geocoding Service

-ADD BRIEF DESCRIPTION-

Additional information about the geocoding service can be found here:

DOCUMENTATION: https://www.census.gov/programs-surveys/geography.html
WEB GEOCODING SERVICE: https://geocoding.geo.census.gov/geocoder/geographies/addressbatch?form

This section runs a Demo to test the code is working. Addresses should be formatted in the following fields:

Unique ID,
House Number and Street Name,
City,
State,
ZIP Code

Geocode adds the following variables to the dataset:

match
match_type
out_address
lat_lon
tiger_line_id
tiger_line_side
state_fips
county_fips
tract_fips
block_fips
lon
lat

Geocode outputs from the Census can either be:

Match (Exact/Non_Exact): can be exact or approximate? irght? WHAT DOES THIS MEAN?
Tie:
No_Match:

Loading the Board Members Dataset (PPL) dataset and subsetting only key variables to test the demo

# loading rds
ppl <- readRDS( "Data/3_GeoCensus/PPLAddresses_census.rds")

# create a demo folder if needed:
# dir.create( "Data/3_GeoCensus/demo" )

# subsetting
demo <- select( ppl, ID, Address, City, State, Zip )
x <- sample(nrow(ppl),20, replace = FALSE)
demo <- demo[ x, ]
rownames(demo) <- NULL

# writing a csv file with the address list to feed into the geocode process
write.csv( demo, "Data/3_GeoCensus/demo/TestAddresses.csv", row.names=F )

In the following code chunk we are executing the geocode query and storing results as csv files.

# temporarily setting the working directory for geocode
wd2 <- paste0(wd, "/Data/3_GeoCensus/demo")
setwd(wd2)

# creating a url and file path to use in the geocode query
apiurl <- "https://geocoding.geo.census.gov/geocoder/geographies/addressbatch"
addressFile <- "TestAddresses.csv"

# geocode query
resp <- POST( apiurl, 
              body=list(addressFile=upload_file(addressFile), 
                        benchmark="Public_AR_Census2010",
                        vintage="Census2010_Census2010",
                        returntype="csv" ), 
              encode="multipart" )

# Writing results in a csv using writelines function
var_names <- c( "id", "input_address", 
                "match", "match_type", 
                "out_address", "lat_lon", 
                "tiger_line_id", "tiger_line_side", 
                "state_fips", "county_fips", 
                "tract_fips", "block_fips" )
var_names <- paste(var_names, collapse=',')
writeLines( text=c(var_names, content(resp)), con="ResultsDemo.csv" )

# loading the results we wrote
res <- read.csv( "ResultsDemo.csv", header=T, stringsAsFactors=F, colClasses="character" )

# Splitting Latitude and longitude coordinates
lat.lon <- strsplit( res$lat_lon, "," )

for( i in 1:length(lat.lon) )
  {
  if( length( lat.lon[[i]] ) < 2 )
  lat.lon[[ i ]] <- c(NA,NA) 
  }

m <- matrix( unlist( lat.lon ), ncol=2, byrow=T )
colnames(m) <- c("lon","lat")
m <- as.data.frame( m )

res <- cbind( res, m )
head( res )

# writing results with splitted lat lon in another file
write.csv( res, "ResultsDemo2.csv", row.names=F )

setwd(wd)

Results look like this:

res <- read.csv( "Data/3_GeoCensus/demo/ResultsDemo2.csv", header=T, stringsAsFactors=F, colClasses="character" )
pander(head(res))

Table continues below
id	input_address	match	match_type
ID-2014-010468034-02	60 COMMUNITY DRIVE, AUGUSTA, ME, 04330	Match	Exact
ID-2014-010468034-01	101 MAIN STREET, ROCKPORT, ME, 04856	Match	Exact
ID-2014-010278788-03	P O BOX 1742, SACO, ME, 04072	No_Match
ID-2014-010278788-02	550 SOUTH WATERBORO ROAD, LYMAN, ME, 04002	Match	Non_Exact
ID-2014-010512631-01	216 HUFFS MILL ROAD, BOWDOIN, ME, 04287	Match	Exact
ID-2014-010278788-01	36 LORDS LANE, LYMAN, ME, 04002	Match	Exact

Table continues below
out_address	lat_lon	tiger_line_id
60 Community Dr, AUGUSTA, ME, 04330	-69.79748,44.341183	75474285
101 Main St, ROCKPORT, ME, 04856	-69.0846,44.193134	75566531
550 E Waterboro Rd, LYMAN, ME, 04002	-70.65294,43.530483	92794668
216 Huffs Mill Rd, BOWDOIN, ME, 04287	-69.920235,44.107376	75613795
36 Lords Ln, LYMAN, ME, 04002	-70.62657,43.489647	92739980

Table continues below
tiger_line_side	state_fips	county_fips	tract_fips	block_fips
L	23	011	010200	2017
R	23	013	970500	1004
R	23	031	024500	2025
R	23	023	970200	3007
L	23	031	024500	3014

lon	lat
-69.79748	44.341183
-69.0846	44.193134
NA	NA
-70.65294	43.530483
-69.920235	44.107376
-70.62657	43.489647

3. Geocoding Nonprofit (NPO) addresses

Uploading the file NPOAddresses_census.rds

npo <- readRDS( "Data/3_GeoCensus/NPOAddresses_census.rds" )

3.1 Dividing the addresses in batches of 500 each

# Create the folders to hold geocoding results if needed:
# dir.create( "Data/3_GeoCensus/addresses_npo")
# dir.create( "Data/3_GeoCensus/addresses_npo/2014-2019")

#setting wd 
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_npo/2014-2019")
setwd(wd2)

# Selecting only essential variables
npo <- select( npo, ID, Address, City, State, Zip )

names( npo ) <- NULL  # we remove colnames because input file should not have names

# Spliting address files into files with 500 addresses each
loops <- ceiling( nrow( npo ) / 500 ) # ceiling function rounds up an integer. so loops has the amount of 500s that fit rounded up.

# loop to extract by addressess in 500 batches
for( i in 1:loops )
  {
  filename <- paste0( "AddressNPO",i,".csv" )
  start.row <- ((i-1)*500+1) # i starts in 1 and this outputs: 1, 501, 1001, etc.
  end.row <- (500*i) # this outputs 500, 1000, etc.
  if( nrow(npo) < end.row ){ end.row <- nrow(npo) } # this tells the loop when to stop

  # writing a line in the csv address file.
  write.csv( npo[ start.row:end.row, ], filename, row.names=F )

  # output to help keep track of the loop.
  print( i )
  print( paste( "Start Row:", start.row ) )
  print( paste( "End Row:", end.row ) )
} # end of loop.

setwd(wd)

3.2 Geocoding NPO addresses

The following code chunk will pass the addresses files produced through the census geocode srvice. This will output the following per each address file:

Rresults[i].csv, with raw geocode results and
RresultsNPO[i].csv, wich has the same data but with splitted lat/lon fields.

In addition, a Geocode_Log.txt will be created for the whole process.

Note: Geocoding this amount of addresses will take significant hours. From our experience geocoding 1000 addresess took XXXX. ADD WHAT TO DO IF PROCESS HALTS IN THE MIDDLE

# create a folder for npo census geocoding files if needed:
# dir.create( "Data/3_GeoCensus/addresses_npo/2014-2019/Results" )

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_npo/2014-2019")
setwd(wd2)

# producing LOG file
log <- c("Query_Number", "Start_time", "Time_taken")
log <- paste(log, collapse=',')
log.name <- as.character(Sys.time())
log.name <- gsub(":","-",log.name)
log.name <- gsub(" ","-",log.name)
log.name <- paste0("Results/Geocode_Log_",log.name,".txt")
write(log, file=log.name, append = F)

#Geocoding loop:
for( i in 1:loops )
  { 
  #creating the objects used in each iteration of the loop: file name and api
  addressFile <- paste0( "AddressNPO",i,".csv" ) 
  apiurl <- "https://geocoding.geo.census.gov/geocoder/geographies/addressbatch"

  #outputs in console to track loop
  print( i )
  print(Sys.time() )
  start_time <- Sys.time()
  
  #Geocode query for i. Query is wrapped with try function to allow error-recovery
  try( 
    resp <- POST( apiurl, 
                  body=list(addressFile=upload_file(addressFile),
                            benchmark="Public_AR_Census2010",
                            vintage="Census2010_Census2010",
                            returntype="csv" ), 
                            encode="multipart" )
    )

  #documenting ending times
  end_time <- Sys.time()
  print( end_time - start_time ) #ouputting in R console

  #writing a line in the log file after query i ends
  query <- as.character(i)
  len <- as.character(end_time - start_time)
  start_time <- as.character(start_time)
  log <- c(query, start_time, len)
  log <- paste(log, collapse=',')
  write(log, file=log.name, append = T)

  #constructing the Results[i].csv filename.
  addressFile2 <- paste0( "Results/Results",i,".csv" ) 

  #creating column names to include in the results csv file
  var_names <- c( "id", "input_address",
                  "match", "match_type", 
                  "out_address", "lat_lon", 
                  "tiger_line_id", "tiger_line_side", 
                  "state_fips", "county_fips", 
                 "tract_fips", "block_fips" )
  v.names <- paste(var_names, collapse=',')
  
  #writing Rresults[i].csv, including headers
  writeLines( text=c(v.names, content(resp)) , con=addressFile2 )

  #reading file
  res <- read.csv( addressFile2, header=T, 
                 stringsAsFactors=F, 
                 colClasses="character" )

  # Splitting latitude and longitude values from results (res) to a variable (lat.lon)
  lat.lon <- strsplit( res$lat_lon, "," )
  
  #adding NAs to lat.lon empty fields
  for( j in 1:length(lat.lon) )
    {
    if( length( lat.lon[[j]] ) < 2 )
      lat.lon[[ j ]] <- c(NA,NA)
    }
  
  #tranforming the splitted lat.lons to columns that can be binded to a dataframe
  m <- matrix( unlist( lat.lon ), ncol=2, byrow=T )
  colnames(m) <- c("lon","lat")
  m <- as.data.frame( m )
  
  #Adding lat and lon values to raw results and writing ResultsNpo[i].csv file
  res <- cbind( res, m )
  write.csv( res, paste0("Results/ResultsNPO",i,".csv"), row.names=F )
  
  } # end of loop

# setting back wd
setwd(wd)

3.3 Combining results Files

This code chunk compiles all results and saves them into NPOAddresses_censusGEO.rds.

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_npo/2014-2019/Results")
setwd(wd2)

#capturing filenames of all elements in dir() that have "ResultsNpo"
x <- grepl("ResultsNpo", dir()) 
these <- (dir())[x]

#loading first file in the string vector
npo <- read.csv( these[1], stringsAsFactors=F )

#compiling all Results into one
for( i in 2:length(these) )
  {
  d <- read.csv( these[i], stringsAsFactors=F )
  npo <- bind_rows( npo, d )
  }

#saving compiled geocodes
saveRDS( npo, "../../../NPOAddresses_censusGEO.rds" )
setwd(wd)

3.4 Integrating results to main NPO address file

Loading files to integrate

# results
npo <- readRDS("Data/3_GeoCensus/NPOAddresses_censusGEO.rds")

#removind the IDs and pob.
npo <- npo[,-c(1,3)]

# main
npo.main <- readRDS("Data/2_InputData/NONPROFITS-2014-2019v2.rds")

Joining to NPO file

npo.main <- left_join(npo.main, npo, by = "input_address")

Adding a geocode_type variable to all Match cases

npo.main$geocode_type <- NA

# Adding a value to all Match values yielded in the process
x <- which(npo.main$match %in% "Match")
npo.main$geocode_type[x] <- "census"
pander(table(npo.main$geocode_type, useNA = "ifany"))

census	NA
182042	81230

Renaming the lat lon vars to make sure we know they come from the census

x <- which(names(npo.main) %in% "lon") 
names(npo.main)[x] <- "lon_cen"

x <- which(names(npo.main) %in% "lat") 
names(npo.main)[x] <- "lat_cen"

x <- which(names(npo.main) %in% "lat_lon") 
names(npo.main)[x] <- "lat_lon_cen"

Saving the new version of the npo.main file

saveRDS(npo.main, "Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds")

3.5 Exploring NPO Geocode Results

Lets take a look at the geolocations of our Nonprofits:

# uploading the file in case needed
# npo.main <- readRDS("Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds") 

plot( npo.main$lon_cen, npo.main$lat_cen, pch=19, cex=0.5, col=gray(0.5,0.01))

Summary of geocode (all):

There are 263272 NPO listed. with 256527 unique addresses

#Summary
x <- table(npo.main$match, useNA = "always")
y <- prop.table(table(npo.main$match, useNA = "always"))

summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)

	frequency	percent
Match	182042	69.15
No_Match	62215	23.63
Tie	3380	1.284
NA.	15635	5.939
TOTAL	263272	100

The following numbers of POBs

x <- round(prop.table(table(npo.main$pob, useNA = "ifany"))*100,1)
names(x) <- c("Non-POB", "POB")
pander(x)

Non-POB	POB
87.6	12.4

Summary of geocode excluding POBs:

x <- which(npo.main$pob == 0)
npo.main1 <- npo.main[x,]

#Summary
x <- table(npo.main1$match, useNA = "always")
y <- prop.table(table(npo.main1$match, useNA = "always"))

summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)

	frequency	percent
Match	180999	78.5
No_Match	32866	14.25
Tie	3302	1.432
NA.	13409	5.815
TOTAL	230576	100

Summary of geocode for only POBs:

x <- which(npo.main$pob == 1)
npo.main2 <- npo.main[x,]

#Summary
x <- table(npo.main2$match, useNA = "always")
y <- prop.table(table(npo.main2$match, useNA = "always"))

summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)

	frequency	percent
Match	1043	3.19
No_Match	29349	89.76
Tie	78	0.2386
NA.	2226	6.808
TOTAL	32696	100

4. Geocoding Board Member (PPL) addresses

Uploading the file PPLAddresses_census.rds

ppl <- readRDS( "Data/3_GeoCensus/PPLAddresses_census.rds" )

4.1 Dividing the addresses in batches of 500 each

# Create the folders to hold geocoding results if needed:
# dir.create( "Data/3_GeoCensus/addresses_ppl")
# dir.create( "Data/3_GeoCensus/addresses_ppl/2014-2019")

#setting wd 
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_ppl/2014-2019")
setwd(wd2)

# Selecting only essential variables
ppl <- select( ppl, ID, Address, City, State, Zip )

# we remove colnames because input file should not have names
names( ppl ) <- NULL  

# Spliting address files into files with 500 addresses each
loops <- ceiling( nrow( ppl ) / 500 ) # ceiling function rounds up an integer. so loops has the amount of 500s that fit rounded up.

# loop to extract by addressess in 500 batches
for( i in 1:loops )
  {
  filename <- paste0( "AddressPPL",i,".csv" )
  start.row <- ((i-1)*500+1) # i starts in 1 and this outputs: 1, 501, 1001, etc.
  end.row <- (500*i) # this outputs 500, 1000, etc.
  if( nrow(ppl) < end.row ){ end.row <- nrow(ppl) } # this tells the loop when to stop

  # writing a line in the csv address file.
  write.csv( ppl[ start.row:end.row, ], filename, row.names=F )

  # output to help keep track of the loop.
  print( i )
  print( paste( "Start Row:", start.row ) )
  print( paste( "End Row:", end.row ) )
} # end of loop.

setwd(wd)

4.2 Geocoding PPL addresses

Passing the addresses files produced through the census geocode srvice. This will output the following per each address file:

Rresults[i].csv, with raw geocode results and
RresultsPPL[i].csv, wich has the same data but with splitted lat/lon fields.

In addition, a Geocode_Log.txt will be created for the whole process.

Note: Geocoding this amount of addresses will take significant hours. From our experience geocoding 1000 addresess took XXXX. ADD WHAT TO DO IF PROCESS HALTS IN THE MIDDLE

# create a folder for ppl census geocoding files if needed:
# dir.create( "Data/3_GeoCensus/addresses_ppl/2014-2019/Results" )

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_ppl/2014-2019")
setwd(wd2)

# producing LOG file
log <- c("Query_Number", "Start_time", "Time_taken")
log <- paste(log, collapse=',')
log.name <- as.character(Sys.time())
log.name <- gsub(":","-",log.name)
log.name <- gsub(" ","-",log.name)
log.name <- paste0("Results/Geocode_Log_",log.name,".txt")
write(log, file=log.name, append = F)

#Geocoding loop:
for( i in 1:loops )
  { 
  #creating the objects that will be used in each iteration of the loop: file name of addresses and api
  addressFile <- paste0( "AddressPPL",i,".csv" ) 
  apiurl <- "https://geocoding.geo.census.gov/geocoder/geographies/addressbatch"

  #outputs in console to track loop
  print( i )
  print(Sys.time() )
  start_time <- Sys.time()
  
  #Geocode query for i. Query is wrapped with try function to allow error-recovery
  try( 
    resp <- POST( apiurl, 
                  body=list(addressFile=upload_file(addressFile),
                            benchmark="Public_AR_Census2010",
                            vintage="Census2010_Census2010",
                            returntype="csv" ), 
                            encode="multipart" )
    )

  #documenting ending times
  end_time <- Sys.time()
  print( end_time - start_time ) #ouputting in R console

  #writing a line in the log file after query i ends
  query <- as.character(i)
  len <- as.character(end_time - start_time)
  start_time <- as.character(start_time)
  log <- c(query, start_time, len)
  log <- paste(log, collapse=',')
  write(log, file=log.name, append = T)

  #constructing the Results[i].csv filename.
  addressFile2 <- paste0( "Results/Results",i,".csv" ) 

  #creating column names to include in the results csv file
  var_names <- c( "id", "input_address",
                  "match", "match_type", 
                  "out_address", "lat_lon", 
                  "tiger_line_id", "tiger_line_side", 
                  "state_fips", "county_fips", 
                 "tract_fips", "block_fips" )
  v.names <- paste(var_names, collapse=',')
  
  #writing Rresults[i].csv, including headers
  writeLines( text=c(v.names, content(resp)) , con=addressFile2 )

  #reading file
  res <- read.csv( addressFile2, header=T, 
                 stringsAsFactors=F, 
                 colClasses="character" )

  # Splitting latitude and longitude values from results (res) to a variable (lat.lon)
  lat.lon <- strsplit( res$lat_lon, "," )
  
  #adding NAs to lat.lon empty fields
  for( j in 1:length(lat.lon) )
    {
    if( length( lat.lon[[j]] ) < 2 )
      lat.lon[[ j ]] <- c(NA,NA)
    }
  
  #tranforming the splitted lat.lons to columns that can be binded to a dataframe
  m <- matrix( unlist( lat.lon ), ncol=2, byrow=T )
  colnames(m) <- c("lon","lat")
  m <- as.data.frame( m )
  
  #Adding lat and lon values to raw results and writing ResultsPPL[i].csv file
  res <- cbind( res, m )
  write.csv( res, paste0("Results/ResultsPPL",i,".csv"), row.names=F )
  
  } # end of loop

# setting back wd
setwd(wd)

4.3 Combining Result Files

This code chunk compiles all results and saves them into PPLAddresses_censusGEO.rds.

# setting wd
wd2 <- paste0(wd, "/Data/3_GeoCensus/addresses_ppl/2014-2019/Results")
setwd(wd2)

#capturing filenames of all elements in dir() that have "Resultsppl"
x <- grepl("ResultsPPL", dir()) 
these <- (dir())[x]

#loading first file in the string vector
ppl <- read.csv( these[1], stringsAsFactors=F )

#compiling all Results into one
for( i in 2:length(these) )
  {
  d <- read.csv( these[i], stringsAsFactors=F )
  ppl <- bind_rows( ppl, d )
  }

#saving compiled geocodes
saveRDS( ppl, "../../PPLAddresses_censusGEO.rds" )
setwd(wd)

4.4 Integrating results to PPL main file

Loading files to integrate

# results
ppl <- readRDS("Data/3_GeoCensus/PPLAddresses_censusGEO.rds")
ppl <- ppl[,-c(1,3)]

# main
ppl.main <- readRDS("Data/2_InputData/PEOPLE-2014-2019v2.rds")

Joining files

ppl.main <- left_join(ppl.main, ppl, by = "input_address")

Adding a geocode_type variable to all Match results

ppl.main$geocode_type <- NA

# Adding a value to all Matches yielded in the process
x <- which(ppl.main$match %in% "Match")
ppl.main$geocode_type[x] <- "census"
pander(table(ppl.main$geocode_type, useNA = "ifany"))

census	NA
689983	256110

Renaming the lat lon vars to make sure we know they come from the census

x <- which(names(ppl.main) %in% "lon") 
names(ppl.main)[x] <- "lon_cen"

x <- which(names(ppl.main) %in% "lat") 
names(ppl.main)[x] <- "lat_cen"

x <- which(names(ppl.main) %in% "lat_lon") 
names(ppl.main)[x] <- "lat_lon_cen"

Saving the new version of the ppl.main file

saveRDS(ppl.main, "Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")

4.5 Exploring PPL Geocode Results

Lets take a look at the geolocations of our Board Members:

# read in file if necessary
# ppl.main <- readRDS("Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")

# plotting data
plot( ppl.main$lon_cen, ppl.main$lat_cen, pch=19, cex=0.5, col=gray(0.5,0.01))

Summary of geocoding process

There are 946093 NPO listed. with 729304 unique addresses

#Summary
x <- table(ppl.main$match, useNA = "always")
y <- prop.table(table(ppl.main$match, useNA = "always"))

summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)

	frequency	percent
Match	689983	72.93
No_Match	183299	19.37
Tie	11591	1.225
NA.	61220	6.471
TOTAL	946093	100