Name Parsing at Scale

Written by Jesse Lecy
Aug 2, 2020

A name parser has been developed to assist with processing large administrative datasets that often have human names reported in a single unstructured text field.

The peopleparser package for R converts the raw text version of a name into a structured database. The program probabilistically determines name order based upon frequencies of the name appearing on US birth certificates.

Gender coding is similarly done probabilistically using the first name from the parsed string and the same Census data table.

The raw text version of the name is returned as a data table with the name parsed into the component parts:

SALUTATION
FIRST_NAME
MIDDLE_NAME
LAST_NAME
SUFFIX
GENDER

# the parse.name() function returns a string
x <- 'livingston III,  Mr. MICHAEL JOHN9'
parse.name(x)
# "MR | MICHAEL | JOHN | LIVINGSTON |III | M | 99.6"
 
x <- "THOMAS H VON KAMECKE"
parse.name(x)
# [1] " | THOMAS | H | VON-KAMECKE | | M | 99.8"
 
 
# the parse.names() function (plural) returns a data frame  
x <- get_example_names( n=10 )
parse.names(x)
                    name salutation first_name middle_name  last_name suffix gender gender_confidence
1          Karen H Green                 KAREN           H      GREEN             F               100
2               ED MARKS                    ED                  MARKS             M               100
3     MATTHEW BERSHADKER               MATTHEW             BERSHADKER             M              99.7
4     LINDA L SCHOELKOPF                 LINDA           L SCHOELKOPF             F               100
5            JAMES ROWAN                 JAMES                  ROWAN             M              99.7

Use With Large Datasets

The parse.names() function utilizes parallelization to speed up large jobs.

To estimate time needed for a large number of names you can benchmark performance by examining runtimes:

x <- get_example_names( n=1000 )
 
start_time <- Sys.time()
pn <- parse.names( x )
end_time <- Sys.time()

end_time - start_time
# Time difference of 8.7648 secs

Installation

The R package can be installed as follows:

devtools::install_github( "Nonprofit-Open-Data-Collective/peopleparser" )
library( peopleparser )

Topic Modeling Reputation

layout: post title: Harmonized Census Tract Data 1970-2010 categories: [Hospitals, Human Resources] tags: human resources, hospitals, healthcare, survey data, performance author: —

← Previous

Complete 990 E-File Index

The IRS starting releasing data from 990 electronic filers starting in 2016. Data can be challenging to access, though, because it lives in XML files on an Amazon server.