titleclassifier Package Overview

An R package to assign raw nonprofit executive titles from Form 990 Part VII to a well-structured title taxonomy.

Example Challenges

##  [1] "PROGRAMS COCHAIR, RESEARCH COCHAIR"    
##  [2] "FIN SECY / TREASURER PAST"             
##  [3] "RECORDING SECY PAST"                   
##  [4] "WARDEN PAST"                           
##  [5] "CONDUCTOR PAST"                        
##  [6] "FINANCIAL SECRETARY/TREASU -PAST"      
##  [7] "VICE PRESIDENT - NEW AND PAST"         
##  [8] "TREASURER - NEW AND PAST"              
##  [9] "WARDEN - NEW AND PAST"                 
## [10] "FINANCIAL SEC'Y - PAST"                
## [11] "VICE - PRES - PAST"                    
## [12] "FIN SEC'Y/ TREASURER - PAST"           
## [13] "BOARD CHAIR, FINANCE COMMITTEE"        
## [14] "TREASURER, DIRECTOR, FINANCE COMMITTEE"
## [15] "FIN'L SECRETARY/TREASURER - PAST"      
## [16] "LOWER DIV COORDINAT"                   
## [17] "UPPER DIV COORDINATE"                  
## [18] "SCOREMASTER"                           
## [19] "SNACK BAR COORDINAT"                   
## [20] "SPONSOR/FUND COORDIN"

Demo of titleclassifier Package

Installation

devtools::install_github( 'nonprofit-open-data-collective/titleclassifier' )

After installation complete without errors load the package.

library( titleclassifier )    # title classifier package 
library( dplyr )              # data wrangling
library( pander )             # nicely formatted output

Demo Data

The package includes some IRS 990 Part VII data for development and demonstration purposes.

data( tinypartvii ) # 10,000 rows of part vii data
head( tinypartvii )

Data Dictionary:

  • OBJECT_ID - id of the xml object storing return data
  • URL - online location of the raw XML form from which data was derived
  • EIN - employee identification number
  • NAME - organization’s name
  • TAXYR - tax year of the data
  • FORMTYPE - type of IRS 990 form completed (990 or 990EZ)
  • F9_07_COMP_DTK_NAME_PERS - director, trustee or key employee name
  • F9_07_COMP_DTK_TITLE - title
  • F9_07_COMP_DTK_AVE_HOUR_WEEK - hours worked per week over tax period
  • F9_07_COMP_DTK_AVE_HOUR_WEEK_RL - hours worked for a related entity
  • F9_07_COMP_DTK_COMP_ORG - direct compensation from the nonprofit (salary or wages)
  • F9_07_COMP_DTK_COMP_RLTD - compensated received from a related entity
  • F9_07_COMP_DTK_COMP_OTH - other compensation amount received (bonus, fringe)
  • F9_07_COMP_DTK_EMPL_BEN - compensation amount in benefits
  • F9_07_COMP_DTK_POS_FORMER_X - checkbox designating former status
  • F9_07_COMP_DTK_POS_HIGH_COMP_X - highly-compensated individual checkbox
  • F9_07_COMP_DTK_POS_OFF_X - officer checkbox
  • F9_07_COMP_DTK_POS_INDIV_TRUST_X - individual trustee checkbox
  • F9_07_COMP_DTK_POS_KEY_EMPL_X - key employee checkbox
  • F9_07_COMP_DTK_POS_INST_TRUST_X - institutional trustee checkbox

Necessary Data

The following internal datasets are used by functions within the package for mapping and disambiguation purposes:

Disambiguation Lists:

  • likely.subjects: used in fix_of()
  • likely.titles: used in ???()
  • possible.titles: used in ???()

Status Words:

Titles are often accompanied by some adjective describing a temporary or temporal period in which the person holds the title. These phrases are removed from the titles and flags are created for each.

These are used in the gen_status_codes() step:

https://github.com/Nonprofit-Open-Data-Collective/titleclassifier/blob/main/R/06-gen-status-codes.R

  • former.words
  • future.words
  • interim.words
  • regional.words
  • possible.regional-list

Demo Data:

Sample datasets for development and testing purposes:

  • tinypartvii
  • rawtitles

The package includes additional data that is used by functions for cleaning and assignment processes.

NOTE: We need to export these:

data( former.words  ) %>% head() 

These data are imported into the package via Google Sheets:

[reference google sheet cases]

Title Cleaning Process

There are approximately 100 meaningful titles in a sparse taxonomy of key leadership positions disclosed on the Directors, Trustees, and Key Employees section of the 990 forms. The title field appears as raw text, however, so there are no constraints on what nonprofits can write in the field. As a result, there are over 900,000 unique strings in the database.

# create sample of 100 observations for testing
set.seed( 1234 )
df.sample <- sample_n( titleclassifier::tinypartvii, 100 )

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   # condition has length > 1
  split_titles() %>% 
  standardize_spelling() %>%   # condition has length > 1
  gen_status_codes() %>% 
  standardize_titles() %>%
  categorize_titles() %>%
  conditional_logic()
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
## [1] "generate status codes step complete"
## [1] "standardize titles step complete"
## [1] "categorize titles step complete"
## [1] "conditional logic step complete"
df <- reorder_df( df )
df

Problem cases:

Why is this one split?

# df[ df$F9_07_COMP_DTK_NAME_PERS == "Pat L Wheeler" , ]


# These need to be unique otherwise duplicates columns 
# PRESIDENT --> CEO
# PRESIDENT --> BOARD PRESIDENT 

googlesheets4::gs4_deauth()
df.standard <- googlesheets4::read_sheet( "1iYEY2HYDZTV0uvu35UuwdgAUQNKXSyab260pPPutP1M", 
                                          sheet="title-standardization", range="A:B",
                                          col_types = "c" )  # c = character
  
length( df.standard$title.variant )
## [1] 1179
length( unique( df.standard$title.variant ))
## [1] 1142
nrow( df.standard )
## [1] 1179
# [1] 713
df.standard <- unique( df.standard )
nrow( df.standard )
## [1] 1158
# [1] 696

# PROBLEM CASES
df.standard %>% 
  group_by( title.variant ) %>% 
  filter( n() > 1 ) %>% 
  arrange( title.variant ) %>% 
  pander()
title.variant title.standard
ASSISTANT DIRECTOR ASSISTANT DIRECTOR
ASSISTANT DIRECTOR BOARD MEMBER
ASSOCIATE DIRECTOR ASSOCIATE DIRECTOR
ASSOCIATE DIRECTOR BOARD MEMBER
CHANCELLOR BOARD PRESIDENT
CHANCELLOR CEO
CHIEF OF OPERATIONS AND FINANCE CFO
CHIEF OF OPERATIONS AND FINANCE COO
DIRECTOR OF OPERATIONS COO
DIRECTOR OF OPERATIONS DIRECTOR OF OPERATIONS
EDUCATION DIRECTOR BOARD CHAIR
EDUCATION DIRECTOR DIRECTOR OF EDUCATION
FINANCE DIRECTOR BOARD CHAIR
FINANCE DIRECTOR DIRECTOR OF FINANCE
FINANCE OFFICER BOARD TREASURER
FINANCE OFFICER FINANCE OFFICER
OPERATIONS DIRECTOR COO
OPERATIONS DIRECTOR DIRECTOR OF OPERATIONS
OPERATIONS MANAGER COO
OPERATIONS MANAGER OPERATIONS MANAGER
REGIONAL DIRECTOR BOARD MEMBER
REGIONAL DIRECTOR REGIONAL DIRECTOR
REGIONAL VICE PRESIDENT BOARD VICE PRESIDENT
REGIONAL VICE PRESIDENT REGIONAL VICE PRESIDENT
RVP BOARD VICE PRESIDENT
RVP REGIONAL VICE PRESIDENT
V CHAIR BOARD CHAIR
V CHAIR BOARD VICE PRESIDENT
VICE PRESIDENT OF OPERATIONS COO
VICE PRESIDENT OF OPERATIONS VICE PRESIDENT OF OPERATIONS
VICE PRESIDENT OF PROGRAMS PROGRAMS VICE PRESIDENT
VICE PRESIDENT OF PROGRAMS VICE PRESIDENT OF PROGRAMS

Why are these cases uncoded?

keep <- c("PNAME", "tot.hours", "tot.comp", "title.raw", 
          "title.standard", "title.v6", "title.v6" )
  
df[ is.na( df$title.standard) , keep ]

Spacing is problematic - causing merges to break on some cases like PRESIDENT & CEO.

df2 <- df[ is.na( df$title.standard ) , keep ]
df2$TitleTxt6
## NULL
#  [1] "VICE PRESIDENT OF COMMUNICATIONS LEADERSHIP" "BOARD MEMBER TREASURER AND CFO"             
#  [3] "VICE PRESIDENT OF SHARED SERVICES"           "CHIEF MEDICAL OFFICER"                      
#  [5] " EXECUTIVE"                                  " EXECUTIVE"                                 
#  [7] "SHOW ANNOUNCER"                              "PRESIDENT "                                 
#  [9] "ASSISTANT VICE PRESIDENT BASEBALL"           "BOARD OF DIRECTOR"                          
# [11] " PRESIDENT"                                  " CEO"                                       
# [13] "  PRESIDENT"                                 "BOARD CHAIR "                               
# [15] "HH  CHAIR"                                   "RECORDER"                                   
# [17] "DIRECTOR "                                   " PRESIDENT"                                 
# [19] "STEWARD" 
  • add chief medical officer
  • how to code “executive”
    • CEO?
    • if CEO exists, code as VP?
  • create generic “VICE PRESIDENT OF” category for cases that don’t fit into current domains?
    • many of these just need title variants: VP OF COMMS & LEADERSHIP –> VP OF COMMS

01 Standardize DF

df <- 
  df.sample %>% 
  standardize_df() 
## [1] "standardize df step complete"
head( df )

02 Remove Dates

Many titles include dates to signify the start or end of the period that a person holds a position. For example:

  • Executive Director until March 2020
  • Executive Director starting March 2020
remove_dates()
¦
+-- convert_ordinal()
+-- identify_date() 
+-- remove_date()

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
paste0( df$TITLE_RAW, "    -->    ", df$TitleTxt2 ) %>% knitr::kable()
x
DIRECTOR (UNTIL 11/18) –> DIRECTOR UNTIL
SECRETARY –> SECRETARY
Secretary –> SECRETARY
BOARD MEMBER –> BOARD MEMBER
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
BOARD OF DIRECTOR –> BOARD OF DIRECTOR
DIRECTOR –> DIRECTOR
Secretary –> SECRETARY
Director –> DIRECTOR
Director –> DIRECTOR
CHAIRMAN –> CHAIRMAN
Director –> DIRECTOR
DIRECTOR –> DIRECTOR
NATIONAL EXECUTIVE DIRECTOR –> NATIONAL EXECUTIVE DIRECTOR
PAST PRESIDENT –> PAST PRESIDENT
DIRECTOR –> DIRECTOR
Treasurer –> TREASURER
CHAIRMAN –> CHAIRMAN
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
Board Member Treasurer and CFO –> BOARD MEMBER TREASURER AND CFO
TREASURER/CH –> TREASURER/CH
ASST. VP - BASEBALL –> ASST VP - BASEBALL
Director –> DIRECTOR
BOARD MEMBER –> BOARD MEMBER
VICE CHAIRMAN/DIRECTOR –> VICE CHAIRMAN/DIRECTOR
TRUSTEE –> TRUSTEE
ASST. SEC/TR –> ASST SEC/TR
DIRECTOR –> DIRECTOR
TREASURER –> TREASURER
SECRETARY –> SECRETARY
DIRECTOR –> DIRECTOR
Secretary-Treas –> SECRETARY-TREAS
SECRETARY/TR –> SECRETARY/TR
VICE PRESIDENT –> VICE PRESIDENT
PRESIDENT –> PRESIDENT
Secretary –> SECRETARY
President –> PRESIDENT
BOARD MEMBER –> BOARD MEMBER
DIRECTOR –> DIRECTOR
TRUSTEE –> TRUSTEE
Vice President –> VICE PRESIDENT
Director –> DIRECTOR
STEWARD –> STEWARD
VP OF COMMUNITY LEADERSHIP –> VP OF COMMUNITY LEADERSHIP
FORMER EXEC. –> FORMER EXEC
SECRETARY –> SECRETARY
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
DIRECTOR –> DIRECTOR
BOARD MEMBER –> BOARD MEMBER
BOARD MEMBER –> BOARD MEMBER
EXECUTIVE VICE PRESIDENT –> EXECUTIVE VICE PRESIDENT
DIRECTOR –> DIRECTOR
BOARD MEMBER –> BOARD MEMBER
VP OF SHARED SERVICES –> VP OF SHARED SERVICES
BOARD CHAIR (through 09/18) –> BOARD CHAIR THROUGH
IMMEDIATE PAST PRESIDENT –> IMMEDIATE PAST PRESIDENT
DIRECTOR –> DIRECTOR
Treasurer –> TREASURER
Director –> DIRECTOR
President –> PRESIDENT
DIRECTOR –> DIRECTOR
Director –> DIRECTOR
PRESIDENT –> PRESIDENT
DIRECTOR –> DIRECTOR
BOARD CHAIR –> BOARD CHAIR
VICE PRESIDENT –> VICE PRESIDENT
CHIEF EXECUTIVE OFFICER/CHIEF MEDICAL OFFICER –> CHIEF EXECUTIVE OFFICER/CHIEF MEDICAL OFFICER
Vice President –> VICE PRESIDENT
BOARD MEMBER –> BOARD MEMBER
BOARD MEMBER –> BOARD MEMBER
RECORDER –> RECORDER
Secretary –> SECRETARY
Show Announcer –> SHOW ANNOUNCER
Director –> DIRECTOR
INTERIM EXEC –> INTERIM EXEC
President –> PRESIDENT
SGT AT ARMS –> SGT AT ARMS
PRESIDENT –> PRESIDENT
DIRECTOR –> DIRECTOR
SECRETARY –> SECRETARY
Director –> DIRECTOR
TREASURER –> TREASURER
PRESIDENT –> PRESIDENT
BOARD MEMBER –> BOARD MEMBER
EXECUTIVE VICE PRESIDENT –> EXECUTIVE VICE PRESIDENT
Board Chairman –> BOARD CHAIRMAN
HH DISTRICT CHAIRMAN –> HH DISTRICT CHAIRMAN
PAST PRESIDE –> PAST PRESIDE
VICE PRESIDE –> VICE PRESIDE
CHAIRMAN –> CHAIRMAN
PRESIDENT ELECT –> PRESIDENT ELECT
Trustee –> TRUSTEE
DIRECTOR –> DIRECTOR
TRUSTEE –> TRUSTEE
Executive Director –> EXECUTIVE DIRECTOR

03 Standardize Conjugations

Convert all conjugations within titles to consistent formats.

standardize_conj( title = "TitleTxt2" )
¦
+-- standardize_and()
+-- standardize_to()
+-- standardize_of()
+-- standardize_comma()
+-- standardize_slash()
+-- standardize_and() #repeated bc of possible standardization changes
+-- standardize_separator()

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
df[ c("TITLE_RAW","TitleTxt2","TitleTxt3") ]  %>% knitr::kable()
TITLE_RAW TitleTxt2 TitleTxt3
DIRECTOR (UNTIL 11/18) DIRECTOR UNTIL DIRECTOR UNTIL
SECRETARY SECRETARY SECRETARY
Secretary SECRETARY SECRETARY
BOARD MEMBER BOARD MEMBER BOARD MEMBER
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
BOARD OF DIRECTOR BOARD OF DIRECTOR BOARD OF DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
Secretary SECRETARY SECRETARY
Director DIRECTOR DIRECTOR
Director DIRECTOR DIRECTOR
CHAIRMAN CHAIRMAN CHAIRMAN
Director DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
NATIONAL EXECUTIVE DIRECTOR NATIONAL EXECUTIVE DIRECTOR NATIONAL EXECUTIVE DIRECTOR
PAST PRESIDENT PAST PRESIDENT PAST PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
Treasurer TREASURER TREASURER
CHAIRMAN CHAIRMAN CHAIRMAN
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
Board Member Treasurer and CFO BOARD MEMBER TREASURER AND CFO BOARD MEMBER TREASURER & CFO
TREASURER/CH TREASURER/CH TREASURER & CH
ASST. VP - BASEBALL ASST VP - BASEBALL ASST VP BASEBALL
Director DIRECTOR DIRECTOR
BOARD MEMBER BOARD MEMBER BOARD MEMBER
VICE CHAIRMAN/DIRECTOR VICE CHAIRMAN/DIRECTOR VICE CHAIRMAN & DIRECTOR
TRUSTEE TRUSTEE TRUSTEE
ASST. SEC/TR ASST SEC/TR ASST SEC & TR
DIRECTOR DIRECTOR DIRECTOR
TREASURER TREASURER TREASURER
SECRETARY SECRETARY SECRETARY
DIRECTOR DIRECTOR DIRECTOR
Secretary-Treas SECRETARY-TREAS SECRETARY-TREAS
SECRETARY/TR SECRETARY/TR SECRETARY & TR
VICE PRESIDENT VICE PRESIDENT VICE PRESIDENT
PRESIDENT PRESIDENT PRESIDENT
Secretary SECRETARY SECRETARY
President PRESIDENT PRESIDENT
BOARD MEMBER BOARD MEMBER BOARD MEMBER
DIRECTOR DIRECTOR DIRECTOR
TRUSTEE TRUSTEE TRUSTEE
Vice President VICE PRESIDENT VICE PRESIDENT
Director DIRECTOR DIRECTOR
STEWARD STEWARD STEWARD
VP OF COMMUNITY LEADERSHIP VP OF COMMUNITY LEADERSHIP VP OF COMMUNITY LEADERSHIP
FORMER EXEC. FORMER EXEC FORMER EXEC
SECRETARY SECRETARY SECRETARY
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
DIRECTOR DIRECTOR DIRECTOR
BOARD MEMBER BOARD MEMBER BOARD MEMBER
BOARD MEMBER BOARD MEMBER BOARD MEMBER
EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
BOARD MEMBER BOARD MEMBER BOARD MEMBER
VP OF SHARED SERVICES VP OF SHARED SERVICES VP OF SHARED SERVICES
BOARD CHAIR (through 09/18) BOARD CHAIR THROUGH BOARD CHAIR THROUGH
IMMEDIATE PAST PRESIDENT IMMEDIATE PAST PRESIDENT IMMEDIATE PAST PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
Treasurer TREASURER TREASURER
Director DIRECTOR DIRECTOR
President PRESIDENT PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
Director DIRECTOR DIRECTOR
PRESIDENT PRESIDENT PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
BOARD CHAIR BOARD CHAIR BOARD CHAIR
VICE PRESIDENT VICE PRESIDENT VICE PRESIDENT
CHIEF EXECUTIVE OFFICER/CHIEF MEDICAL OFFICER CHIEF EXECUTIVE OFFICER/CHIEF MEDICAL OFFICER CHIEF EXECUTIVE OFFICER & CHIEF MEDICAL OFFICER
Vice President VICE PRESIDENT VICE PRESIDENT
BOARD MEMBER BOARD MEMBER BOARD MEMBER
BOARD MEMBER BOARD MEMBER BOARD MEMBER
RECORDER RECORDER RECORDER
Secretary SECRETARY SECRETARY
Show Announcer SHOW ANNOUNCER SHOW ANNOUNCER
Director DIRECTOR DIRECTOR
INTERIM EXEC INTERIM EXEC INTERIM EXEC
President PRESIDENT PRESIDENT
SGT AT ARMS SGT AT ARMS SGT AT ARMS
PRESIDENT PRESIDENT PRESIDENT
DIRECTOR DIRECTOR DIRECTOR
SECRETARY SECRETARY SECRETARY
Director DIRECTOR DIRECTOR
TREASURER TREASURER TREASURER
PRESIDENT PRESIDENT PRESIDENT
BOARD MEMBER BOARD MEMBER BOARD MEMBER
EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT
Board Chairman BOARD CHAIRMAN BOARD CHAIRMAN
HH DISTRICT CHAIRMAN HH DISTRICT CHAIRMAN HH DISTRICT CHAIRMAN
PAST PRESIDE PAST PRESIDE PAST PRESIDE
VICE PRESIDE VICE PRESIDE VICE PRESIDE
CHAIRMAN CHAIRMAN CHAIRMAN
PRESIDENT ELECT PRESIDENT ELECT PRESIDENT ELECT
Trustee TRUSTEE TRUSTEE
DIRECTOR DIRECTOR DIRECTOR
TRUSTEE TRUSTEE TRUSTEE
Executive Director EXECUTIVE DIRECTOR EXECUTIVE DIRECTOR

04 Split Titles

split_titles()
¦
+-- split_title()
¦
¦   +-- unmingle()
¦
¦       +-- check_hunspell()
¦       +-- standardize_space()

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%  
  split_titles() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
df[ c("F9_07_COMP_DTK_NAME_PERS","Num.Titles","TITLE_RAW","TitleTxt4","TitleTxt3") ] %>% 
  dplyr::rename( PNAME = F9_07_COMP_DTK_NAME_PERS ) %>%
  group_by( PNAME ) %>% 
  mutate( N=max(Num.Titles) ) %>% 
  dplyr::arrange( desc(N), PNAME )

05 Standardize Spelling

[ change “substitute_” to “fix_” in function names ? ]

standardize_spelling()
¦
+-- fix_spelling()
¦   +-- substitute_vice()  
¦   +-- substitute_executive(TitleTxt)
¦   +-- substitute_director(TitleTxt)
¦   +-- substitute_operations(TitleTxt)
¦   +-- substitute_assistant(TitleTxt)
¦   +-- substitute_president(TitleTxt)
¦   +-- substitute_secretary(TitleTxt)
¦   +-- substitute_treasurer(TitleTxt)
¦   +-- substitute_finance(TitleTxt)
¦   +-- substitute_senior(TitleTxt)
¦   +-- substitute_development(TitleTxt)
¦   +-- substitute_chair(TitleTxt)
¦   +-- substitute_officer(TitleTxt)
¦   +-- substitute_admin(TitleTxt)
¦   +-- substitute_coordinator(TitleTxt)
¦   +-- substitute_strategy(TitleTxt) #strategy/strategic
¦   +-- substitute_hr(TitleTxt)
¦   +-- substitute_manage(TitleTxt) #management/managing/manager
¦   +-- substitute_programs(TitleTxt) #programs/programming
¦   +-- substitute_projects(TitleTxt)
¦   +-- substitute_public(TitleTxt)
¦   +-- substitute_business(TitleTxt)
¦   +-- substitute_comm(TitleTxt) #communication/committee
¦   +-- substitute_information(TitleTxt)
¦   +-- substitute_intelligence(TitleTxt)
¦   +-- substitute_technology(TitleTxt)
¦   +-- substitute_institute(TitleTxt) #institute/institutional
¦   +-- substitute_academics(TitleTxt) #academics/academy
¦   +-- substitute_marketing(TitleTxt)
¦   +-- substitute_advancement(TitleTxt)
¦   +-- substitute_philanthropy(TitleTxt)
¦   +-- substitute_systems(TitleTxt)
¦   +-- substitute_general(TitleTxt)
¦   +-- substitute_planning(TitleTxt) #planning/planned
¦   +-- substitute_compliance(TitleTxt)
¦   +-- substitute_enrollment(TitleTxt)
¦   +-- substitute_admissions(TitleTxt)
¦   +-- substitute_deputy(TitleTxt)
¦   +-- substitute_corresponding(TitleTxt) #correspondent/corresponding
¦   +-- substitute_emeritus(TitleTxt)
¦   +-- substitute_relations(TitleTxt)
¦   +-- substitute_representative(TitleTxt)
¦   +-- substitute_board(TitleTxt)
¦   +-- substitute_transportation(TitleTxt)
¦   +-- substitute_exofficio(TitleTxt)
¦   +-- substitute_atlarge(TitleTxt)
¦   +-- substitute_member(TitleTxt)
¦   +-- substitute_governor(TitleTxt)
¦   +-- condense_abbreviations(TitleTxt) 
¦   +-- substitute_miscellaneous(TitleTxt)
¦   +-- remove_trailing_conjunctions(TitleTxt)
¦   +-- stand_titles(TitleTxt)
¦   +-- fix_of(TitleTxt)
¦   +-- spellcheck(TitleTxt) #slows things down, but is useful

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   
  split_titles() %>% 
  standardize_spelling() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
df[ c("TitleTxt4","TitleTxt5") ] %>% 
  filter( TitleTxt4 != TitleTxt5 )

NOTES:

VICE PRESIDENT MEMBERSHIP –> VICE PRESIDENT OF MEMBERSHPIP? VICE PRESIDENT VOLUNTEERS –> VICE PRESIDENT OF VOLUNTEERS?

Inspect all to see if there are cases where spelling was not corrected and it should be.

df[ c("TitleTxt4","TitleTxt5") ] %>% knitr::kable() 
TitleTxt4 TitleTxt5
1 DIRECTOR UNTIL DIRECTOR UNTIL
2 SECRETARY SECRETARY
3 SECRETARY SECRETARY
4 BOARD MEMBER BOARD MEMBER
5 DIRECTOR DIRECTOR
6 DIRECTOR DIRECTOR
7 BOARD OF DIRECTOR BOARD OF DIRECTOR
8 DIRECTOR DIRECTOR
9 SECRETARY SECRETARY
10 DIRECTOR DIRECTOR
11 DIRECTOR DIRECTOR
12 CHAIRMAN CHAIR
13 DIRECTOR DIRECTOR
14 DIRECTOR DIRECTOR
15 NATIONAL EXECUTIVE DIRECTOR NATIONAL CEO
16 PAST PRESIDENT PAST PRESIDENT
17 DIRECTOR DIRECTOR
18 TREASURER TREASURER
19 CHAIRMAN CHAIR
20 DIRECTOR DIRECTOR
21 DIRECTOR DIRECTOR
22 DIRECTOR DIRECTOR
23 BOARD MEMBER TREASURER BOARD MEMBER TREASURER
23.1 CFO CFO
24 TREASURER TREASURER
24.1 CH CHAIR
25 ASST VP BASEBALL ASSISTANT VICE PRESIDENT BASEBALL
26 DIRECTOR DIRECTOR
27 BOARD MEMBER BOARD MEMBER
28 VICE CHAIRMAN VICE CHAIR
28.1 DIRECTOR DIRECTOR
29 TRUSTEE TRUSTEE
30 ASST SEC ASSISTANT SECRETARY
30.1 TR TREASURER
31 DIRECTOR DIRECTOR
32 TREASURER TREASURER
33 SECRETARY SECRETARY
34 DIRECTOR DIRECTOR
35 SECRETARY-TREAS SECRETARY-TREASURER
36 SECRETARY SECRETARY
36.1 TR TREASURER
37 VICE PRESIDENT VICE PRESIDENT
38 PRESIDENT PRESIDENT
39 SECRETARY SECRETARY
40 PRESIDENT PRESIDENT
41 BOARD MEMBER BOARD MEMBER
42 DIRECTOR DIRECTOR
43 TRUSTEE TRUSTEE
44 VICE PRESIDENT VICE PRESIDENT
45 DIRECTOR DIRECTOR
46 STEWARD STEWARD
47 VP OF COMMUNITY LEADERSHIP VICE PRESIDENT OF COMMUNITY LEADERSHIP
48 FORMER EXEC FORMER EXECUTIVE
49 SECRETARY SECRETARY
50 DIRECTOR DIRECTOR
51 DIRECTOR DIRECTOR
52 DIRECTOR DIRECTOR
53 DIRECTOR DIRECTOR
54 BOARD MEMBER BOARD MEMBER
55 BOARD MEMBER BOARD MEMBER
56 EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT
57 DIRECTOR DIRECTOR
58 BOARD MEMBER BOARD MEMBER
59 VP OF SHARED SERVICES VICE PRESIDENT OF SHARED SERVICES
60 BOARD CHAIR THROUGH BOARD CHAIR THROUGH
61 IMMEDIATE PAST PRESIDENT IMMEDIATE PAST PRESIDENT
62 DIRECTOR DIRECTOR
63 TREASURER TREASURER
64 DIRECTOR DIRECTOR
65 PRESIDENT PRESIDENT
66 DIRECTOR DIRECTOR
67 DIRECTOR DIRECTOR
68 PRESIDENT PRESIDENT
69 DIRECTOR DIRECTOR
70 BOARD CHAIR BOARD CHAIR
71 VICE PRESIDENT VICE PRESIDENT
72 CHIEF EXECUTIVE OFFICER CEO
72.1 CHIEF MEDICAL OFFICER CHIEF MEDICAL OFFICER
73 VICE PRESIDENT VICE PRESIDENT
74 BOARD MEMBER BOARD MEMBER
75 BOARD MEMBER BOARD MEMBER
76 RECORDER RECORDER
77 SECRETARY SECRETARY
78 SHOW ANNOUNCER SHOW ANNOUNCER
79 DIRECTOR DIRECTOR
80 INTERIM EXEC INTERIM EXECUTIVE
81 PRESIDENT PRESIDENT
82 SGT AT ARMS SERGEANT AT ARMS
83 PRESIDENT PRESIDENT
84 DIRECTOR DIRECTOR
85 SECRETARY SECRETARY
86 DIRECTOR DIRECTOR
87 TREASURER TREASURER
88 PRESIDENT PRESIDENT
89 BOARD MEMBER BOARD MEMBER
90 EXECUTIVE VICE PRESIDENT EXECUTIVE VICE PRESIDENT
91 BOARD CHAIRMAN BOARD CHAIR
92 HH DISTRICT CHAIRMAN HH DISTRICT CHAIR
93 PAST PRESIDE PAST PRESIDENT
94 VICE PRESIDE VICE PRESIDENT
95 CHAIRMAN CHAIR
96 PRESIDENT ELECT PRESIDENT ELECT
97 TRUSTEE TRUSTEE
98 DIRECTOR DIRECTOR
99 TRUSTEE TRUSTEE
100 EXECUTIVE DIRECTOR CEO

06 Generate Status Codes

gen_status_codes()
¦
+-- categorize_miscellaneous()
¦   
¦   +-- identify_sched_o()
¦   +-- identify_at_large()
¦   +-- identify_as_needed()
¦   +-- identify_ex_officio
¦   +-- identify_co
¦   +-- filter_regional
¦   +-- remove_numbers
¦
+-- categorize_qualifiers()
¦   +-- standardize_qualifiers()  

Assets

# print lists of words
# used for status mapping 
former-words.rda: 
future-words.rda: 
interim-words.rda: 
regional-words.rda: 
possible-regional-list.rda: 

# is this used still?   
status-mapping.rda:

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   
  split_titles() %>% 
  standardize_spelling() %>%   
  gen_status_codes() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
## [1] "generate status codes step complete"
df[ c("TitleTxt5","TitleTxt6") ]

Status codes:

df$REGIONAL.X    <- as.numeric( df$REGIONAL.X)
df$FUTURE.X      <- as.numeric( df$FUTURE.X )
df$FORMER.X      <- as.numeric( df$FORMER.X )
df$INTERIM.X     <- as.numeric( df$INTERIM.X )
df$OUTGOING.X     <- as.numeric( df$OUTGOING.X )
  
keep <- 
  c("Num.Titles", "TITLE_RAW", "TitleTxt6", 
    "DATE.X", "SCHED.O.X", "AT.LARGE.X",
    "EXOFFICIO.X", "CO.X",  
    "REGIONAL.X",  "FUTURE.X", "FORMER.X", "INTERIM.X", "OUTGOING.X")

df[ keep ]

NOTES:

Make all status code boolean.

07 Standardize Titles

Map all of the current titles onto a standardized list.

standardize_titles()

Assets

googlesheets4::gs4_deauth()
df.standard <- googlesheets4::read_sheet( "1iYEY2HYDZTV0uvu35UuwdgAUQNKXSyab260pPPutP1M", 
                                          sheet="title-standardization", range="A:B",
                                          col_types = "c" )  # c = character

df.standard <- df.standard[ c("title.standard","title.variant") ] 

df.standard %>% 
  unique( ) %>% 
  arrange( title.standard, title.variant )

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   # condition has length > 1
  split_titles() %>% 
  standardize_spelling() %>%   # condition has length > 1
  gen_status_codes() %>% 
  standardize_titles() 
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
## [1] "generate status codes step complete"
## [1] "standardize titles step complete"
df[ c("TITLE_RAW","TitleTxt6","title.standard")]

08 Categorize Titles

Apply the category codes for domain and status hierarchies to the standardized versions of the title.

categorize_titles()

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   # condition has length > 1
  split_titles() %>% 
  standardize_spelling() %>%   # condition has length > 1
  gen_status_codes() %>% 
  standardize_titles() %>%
  categorize_titles()
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
## [1] "generate status codes step complete"
## [1] "standardize titles step complete"
## [1] "categorize titles step complete"
reorder_df( df )

09 Conditional Logic

Applies conditional logic to further refine the dataset. For example, checks organizations with multiple CEOs if they are actually multiple CEOs or if it’s a coding mistake. Uses suborganization categories (e.g. pay rank, hours worked) to determine if recategorization is necessary.

conditional_logic()
¦
+-- clean_up_ceos()
+-- director_correction()

Function Demo

df <- 
  df.sample %>% 
  standardize_df() %>% 
  remove_dates() %>% 
  standardize_conj() %>%   # condition has length > 1
  split_titles() %>% 
  standardize_spelling() %>%   # condition has length > 1
  gen_status_codes() %>% 
  standardize_titles() %>%
  categorize_titles() %>%
  conditional_logic()
## [1] "standardize df step complete"
## [1] "remove dates step complete"
## [1] "standardize conjunctions step complete"
## [1] "split titles step complete"
## [1] "standardize spelling step complete"
## [1] "generate status codes step complete"
## [1] "standardize titles step complete"
## [1] "categorize titles step complete"
## [1] "conditional logic step complete"
reorder_df( df )

10 Generate Taxonomy

Not fully implemented yet in this package, but this is similarly linked to the Google Sheets data dictionary document.