PENDING
In this script we will geocode addresses that were not geocoded by the census service through the google geocode service. The new dataset will be saved as a new version of the main files NONPROFITS-2014-2019v4.rds and PEOPLE-2014-2019v4.rds.
STEPS
NOTES
PACKAGES
# loading main files
npo.main <- readRDS("Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds")
ppl.main <- readRDS("Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")
Addresses with failed Geocodes and excluding POBs will be will be saved into files NPOAddresses_google.rds and PPLAddresses_google.rds for passing them through the Google geocoding service.
Google geocoding process, needs only input_address and ID variables
Subsetting NPO addresses for google geocoding
# will subset NA, Tie and No_match values
x <- which(npo.main$match %in% c("No_Match", "Tie", NA))
npo <- npo.main[x,c(1,71,73)]
# removing POBs
x <- which(npo$pob == 0)
npo <- npo[x,]
# reseting ID and removing duplicate addresses
npo$ID <- 0
npo <- unique(npo)
npo <- npo[order(npo$input_address),]
npo$ID <- 1:nrow(npo)
rownames(npo) <- NULL
saveRDS(npo, "Data/4_GeoGoogle/NPOAddresses_google.rds")
Subsetting PPL addresses for google geocoding
# will subset NA, Tie and No_match values and all POBs
x <- which(ppl.main$match %in% c("No_Match", "Tie", NA))
ppl <- ppl.main[x,c(1,19, 21)]
# removing POBs
x <- which(ppl$pob == 0)
ppl <- ppl[x,]
# Reseting ID var and removing duplicate addresses
ppl$ID <- 0
ppl <- unique(ppl)
ppl <- ppl[order(ppl$input_address),]
ppl$ID <- 1:nrow(ppl)
rownames(ppl) <- NULL
saveRDS(ppl, "Data/4_GeoGoogle/PPLAddresses_google.rds")
Sources: • https://lucidmanager.org/geocoding-with-ggmap/
• https://www.wpgmaps.com/documentation/troubleshooting/this-api-project-is-not-authorized-to-use-this-api/ • https://www.rdocumentation.org/packages/ggmap/versions/3.0.0/topics/geocode
The google API receives data as a character vector of street addresses or place names (e.g. “1600 pennsylvania avenue, washington dc” or “Baylor University”) and returns lat and lon coordinates.
Even though there should be no costs (because we are using the free geocodes available), you will need to enter your credit card information. Google allows for 40,000 calls a month for free. WHERE CAN WE CHECK THIS if THE POLICY CHANGES?
Steps to set up a google API:
Using the API Library enable the following three APIs:
Google recommends placing restrictions to your account, to limit the possibility of being charged.
Testing the google service on a small sample…
setwd(wd)
ppl <- readRDS("Data/4_GeoGoogle/PPLAddresses_google.rds")
# 1. selecting the 40 after to test geocode
x <- ppl$ID
x <- sample(x, 5, replace = FALSE, prob = NULL)
smpl <- ppl[x,]
api <- readLines("../google1.api") # reading my personal API key from a local file
register_google(key = api) #The register_google function stores the API key.
getOption("ggmap") #summarises the Google credentials to check how you are connected.
dat <- mutate_geocode(smpl, input_address, output = "latlona", source = "google", messaging = T) #generates an object where the original dataset is binded with the geocode results.
saveRDS(dat, "Data/4_GeoGoogle/DemoResults.rds")
The output of the google geocode looks like this:
ID | pob | input_address | lon | lat |
---|---|---|---|---|
54520 | 0 | 2072 CARMEL RD NORTH, NEWBURGH, ME, 04444 | -68.96 | 44.73 |
177709 | 1 | PO BOX 358, ELFRIDA, AZ, 85610 | -109.7 | 31.69 |
18991 | 0 | 1217 NW ROLLING ROCK RD, ANKENY, IA, 50023 | -93.64 | 41.74 |
60118 | 0 | 2222 HOME PARK CIR W, JACKSONVILLE, FL, 32207 | -81.63 | 30.3 |
148525 | 0 | 8783 MONCOVE LAKE RD, GAP MILLS, WV, 24941 | -80.32 | 37.66 |
address |
---|
2072 carmel rd n, newburgh, me 04444, usa |
elfrida, az 85610, usa |
1217 nw rolling rock rd, ankeny, ia 50023, usa |
2222 w home park cir, jacksonville, fl 32207, usa |
8783 moncove lake rd, gap mills, wv 24941, usa |
Loading file
# setting wd
wd2 <- paste0(wd, "/Data/4_GeoGoogle")
setwd(wd2)
# Loading addresses to geocode
npo <- readRDS("NPOAddresses_google.rds")
We have 48206 addresses to geocode. This will have to be divided in two batches.
The following code chunk will run the first batch npo1:
# loading API
api <- readLines("../../../google1.api")
# Geocoding through google. This will generate an object where the priginal dataset is binded with the geocode results.
register_google(key = api) #The register_google function stores the API key.
getOption("ggmap") #summarises the Google credentials to check how you are connected.
npo1.res <- mutate_geocode(npo1, input_address, output = "latlona", source = "google", messaging = T)
#saving results
saveRDS(npo1.res, "Results/npo1res.rds") # change the name of the file accordingly
Loading files to integrate
setwd(wd)
# loading results file
npo.res <- readRDS("Data/4_GeoGoogle/Results/NPOAddresses_googleGEO.rds")
npo.res <- npo.res[,-c(1,2)]
names(npo.res) <- c("input_address", "lon_ggl", "lat_ggl", "address_ggl")
# loading main
npo.main <- readRDS("Data/3_GeoCensus/NONPROFITS-2014-2019v3.rds")
Joining files
Adding a geocode_type variable to all Addresses geocoded by google
Saving the new version of the ppl.main file
Lets take a look at the geolocations of our Board Members:
Summary of geocoding process
There are 263272 NPO listed. with 256527 unique addresses
#Summary
x <- table(npo.main$geocode_type, useNA = "always")
y <- prop.table(table(npo.main$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 182042 | 69.15 |
37794 | 14.36 | |
NA. | 43436 | 16.5 |
TOTAL | 263272 | 100 |
The following numbers of POBs
x <- round(prop.table(table(npo.main$pob, useNA = "ifany"))*100,1)
names(x) <- c("Non-POB", "POB")
pander(x)
Non-POB | POB |
---|---|
87.6 | 12.4 |
Summary of geocoding process excluding POBs:
x <- which(npo.main$pob == 0)
npo.main1 <- npo.main[x,]
#Summary
x <- table(npo.main1$geocode_type, useNA = "always")
y <- prop.table(table(npo.main1$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 180999 | 78.5 |
37794 | 16.39 | |
NA. | 11783 | 5.11 |
TOTAL | 230576 | 100 |
Summary of geocoding process for only POBs:
x <- which(npo.main$pob == 1)
npo.main2 <- npo.main[x,]
#Summary
x <- table(npo.main2$geocode_type, useNA = "always")
y <- prop.table(table(npo.main2$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 1043 | 3.19 |
NA. | 31653 | 96.81 |
TOTAL | 32696 | 100 |
Loading file
# setting wd
wd2 <- paste0(wd, "/Data/4_GeoGoogle")
setwd(wd2)
# Loading addresses to geocode
ppl <- readRDS("PPLAddresses_google.rds")
We have 161067 addresses to geocode. This will have to be divided in five batches.
The following code chunk will run the first batch ppl1:
# loading API
api <- readLines("../../../google1.api")
# Geocoding through google. This will generate an object where the priginal dataset is binded with the geocode results.
register_google(key = api) #The register_google function stores the API key.
getOption("ggmap") #summarises the Google credentials to check how you are connected.
ppl1.res <- mutate_geocode(ppl1, input_address, output = "latlona", source = "google", messaging = T)
#saving results
saveRDS(ppl1.res, "Results/ppl1res.rds") # change the name of the file accordingly
Loading files to integrate
setwd(wd)
# loading results file
ppl.res <- readRDS("Data/4_GeoGoogle/Results/PPLAddresses_googleGEO.rds")
ppl.res <- ppl.res[,-c(1,2)]
names(ppl.res) <- c("input_address", "lon_ggl", "lat_ggl", "address_ggl")
# loading main
ppl.main <- readRDS("Data/3_GeoCensus/PEOPLE-2014-2019v3.rds")
Joining files
Adding a geocode_type variable to all Addresses geocoded by google
Saving the new version of the ppl.main file
Summary of geocoding process
There are 946093 PPL listed. with 729304 unique addresses
#Summary
x <- table(ppl.main$geocode_type, useNA = "always")
y <- prop.table(table(ppl.main$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 689983 | 72.93 |
149739 | 15.83 | |
NA. | 106371 | 11.24 |
TOTAL | 946093 | 100 |
The following numbers of POBs
x <- round(prop.table(table(ppl.main$pob, useNA = "ifany"))*100,1)
names(x) <- c("Non-POB", "POB")
pander(x)
Non-POB | POB |
---|---|
94.1 | 5.9 |
Summary of geocoding process excluding POBs:
x <- which(ppl.main$pob == 0)
ppl.main1 <- ppl.main[x,]
#Summary
x <- table(ppl.main1$geocode_type, useNA = "always")
y <- prop.table(table(ppl.main1$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 688679 | 77.33 |
149739 | 16.81 | |
NA. | 52152 | 5.856 |
TOTAL | 890570 | 100 |
Summary of geocoding process for only POBs:
x <- which(ppl.main$pob == 1)
ppl.main2 <- ppl.main[x,]
#Summary
x <- table(ppl.main2$geocode_type, useNA = "always")
y <- prop.table(table(ppl.main2$geocode_type, useNA = "always"))
summary <- as.data.frame(t(rbind(x,y)))
colnames(summary) <- c("frequency", "percent")
summary$percent <- summary$percent*100
summary[nrow(summary)+1,] <- c(sum(summary$frequency), 100)
rownames(summary)[nrow(summary)] <- "TOTAL"
pander(summary)
frequency | percent | |
---|---|---|
census | 1304 | 2.349 |
NA. | 54219 | 97.65 |
TOTAL | 55523 | 100 |
In the case a geocode process is aborted before finishing, you might need to geocode the process again. The code below helps to compile all geocode results into one.
ADD examples…
What happens if census has to be done multiple times? If the data breaks? How to manage? • Potential troubleshooting: IDs • Getting stuck and having to reset • Blank files returned • What data checks can we do to make sure the step is final? • Setting your computer to not sleep or turn harddrive off.