Which #TidyTuesday post offices are in Hawaii?
By Julia Silge in rstats tidymodels
April 14, 2021
This is the latest in my series of
screencasts demonstrating how to use the
tidymodels packages, from starting out with first modeling steps to tuning more complex models. Todayβs screencast walks through how to use text information at the subword level in predictive modeling, with this weekβs
#TidyTuesday
dataset on United States post offices. π¬
Here is the code I used in the video, for those who prefer reading instead of or in addition to video.
Explore data
Our modeling goal is to predict whether a post office is in Hawaii or not based on the name of the office in this weekβs #TidyTuesday dataset.
Letβs start by reading in the data.
library(tidyverse)
post_offices <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-04-13/post_offices.csv")
How many post offices are there in each state?
post_offices %>%
count(state, sort = TRUE)
## # A tibble: 53 x 2
## state n
## <chr> <int>
## 1 PA 8534
## 2 TX 7772
## 3 KY 7432
## 4 VA 7085
## 5 NC 6547
## 6 MO 6232
## 7 NY 6111
## 8 TN 6025
## 9 GA 5754
## 10 OH 5537
## # β¦ with 43 more rows
In Hawaii, the names of the post offices are unique, but there are not many at all, compared to the other states.
post_offices %>%
filter(state == "HI") %>%
pull(name)
## [1] "AIEA" "ANAHOLA" "CANTON ISLAND"
## [4] "CAPTAIN COOK" "COCONUT ISLAND" "ELEELE"
## [7] "EWA" "EWA BEACH" "FERNDALE"
## [10] "FORT KAMEHAMEHA" "FORT SHAFTER" "GLENWOOD"
## [13] "HAIKU" "HAINA" "HAIULA"
## [16] "HAKALAU" "HALAULA" "HALAWA"
## [19] "HALEIWA" "HALEIWA" "HALIIMAILE RURAL BR."
## [22] "HAMAKUA" "HAMAKUAPOKO" "HAMOA"
## [25] "HANA" "HANALEI" "HANAMAULU"
## [28] "HANAPEPE" "HATANLA" "HAUULA"
## [31] "HAUULA" "HAWAII NATIONAL PARK" "HAWI"
## [34] "HEEIA" "HICKAM FIELD BR." "HILEA"
## [37] "HILO" "HOLUALOA" "HOMESTEAD"
## [40] "HONALUA" "HONAUNAU" "HONOKAA"
## [43] "HONOKOHAU" "HONOKOHUA" "HONOLUA"
## [46] "HONOLULU" "HONOMU" "HONOULIULI"
## [49] "HONUAPO" "HOOKENA" "HOOLEHUA"
## [52] "HOOPULOA" "HUEHUE" "HUELO"
## [55] "KAAAWA" "KAALAEA" "KAANAPALI"
## [58] "KAHAKULOA" "KAHANA" "KAHUKU"
## [61] "KAHULUI" "KAI MALINO" "KAILUA"
## [64] "KAILUA KONA" "KALAE" "KALAHEO"
## [67] "KALAPANA" "KALAPANA" "KALAUPAPA"
## [70] "KALAWAO" "KALOA" "KAMALO"
## [73] "KAMUELA" "KANEOHE" "KAPAA"
## [76] "KAPAA" "KAPAAU" "KAPOHO"
## [79] "KAPOLEI" "KAULAWAI" "KAUMAKANI"
## [82] "KAUNAKAKAI" "KAUPO" "KAWAIHAE"
## [85] "KAWAILOA" "KEAAU" "KEAAU"
## [88] "KEAHUA" "KEALAKEKUA" "KEALIA"
## [91] "KEANAE" "KEAUHOU" "KEKAHA"
## [94] "KEOKEA" "KEOMUKU" "KIHEI"
## [97] "KILAUEA" "KIPAHULU" "KOHALA"
## [100] "KOLOA" "KONA" "KUALAPUU"
## [103] "KUKUAIAU" "KUKUIHAELE" "KUNIA"
## [106] "KURTISTOWN" "LAHAINA" "LAIE"
## [109] "LALAMILO" "LANAI CITY" "LANIKAI"
## [112] "LAUPAHOEHOE" "LAWAI" "LIBBYVILLE"
## [115] "LIHUE" "LUKE FIELD BR." "MAHUKONA"
## [118] "MAKAWAO" "MAKAWELI" "MAKENA"
## [121] "MAKUA" "MANA" "MAUNA LOA"
## [124] "MAUNALOA" "MAUNAWEI" "MIDWAY ISLAND"
## [127] "MOUNTAINVIEW" "NAALEHU" "NAHIKU"
## [130] "NAHIKU" "NANAKULI" "NANAKULI"
## [133] "NAPOOPOO" "NINOLE" "OLAA"
## [136] "OLLA PLANTATION" "OLOWALU" "OOKALA"
## [139] "OPIHIKAO" "PAAUHAU" "PAAUILO"
## [142] "PAHALA" "PAHOA" "PAIA"
## [145] "PAPAALOA" "PAPAIKOU" "PAPAIKOU"
## [148] "PAUWELA" "PEAHI" "PEARL CITY"
## [151] "PEARL HARBOR" "PEARL HARBOR" "PELEKUNU"
## [154] "PEPEEKEO" "POHOKI" "PRINCEVILLE"
## [157] "PUHI" "PUKALANI" "PUKOO"
## [160] "PUKOO" "PUNALUU" "PUULOA HALE BR."
## [163] "PUUNENE" "ROOSEVELT" "SCHOFIELD BARRACKS"
## [166] "SPRECKELSVILLE" "SPRECKELSVILLE" "SUNSET BEACH"
## [169] "ULUPALAKUA" "ULUPALAKUA" "VOLCANO"
## [172] "VOLCANO HOUSE" "WAHIAWA" "WAIAHOLE"
## [175] "WAIAKOA" "WAIALEE" "WAIALUA"
## [178] "WAIANAE" "WAIHEE" "WAIHEE"
## [181] "WAIKANE" "WAIKAUPU" "WAIKOLOA"
## [184] "WAILUKU" "WAIMANALO" "WAIMEA"
## [187] "WAIMEA" "WAINIHA" "WAIOHINU"
## [190] "WAIPAHU" "WAIPIO" "WATERTOWN"
Build a model
We can start by loading the tidymodels metapackage, splitting our data into training and testing sets, and creating cross-validation samples. Think about this stage as spending your data budget.
library(tidymodels)
set.seed(123)
po_split <- post_offices %>%
mutate(state = case_when(
state == "HI" ~ "Hawaii",
TRUE ~ "Other"
)) %>%
select(name, state) %>%
initial_split(strate = state)
po_train <- training(po_split)
po_test <- testing(po_split)
set.seed(234)
po_folds <- vfold_cv(po_train, strata = state)
po_folds
## # 10-fold cross-validation using stratification
## # A tibble: 10 x 2
## splits id
## <list> <chr>
## 1 <split [112144/12461]> Fold01
## 2 <split [112144/12461]> Fold02
## 3 <split [112144/12461]> Fold03
## 4 <split [112144/12461]> Fold04
## 5 <split [112144/12461]> Fold05
## 6 <split [112145/12460]> Fold06
## 7 <split [112145/12460]> Fold07
## 8 <split [112145/12460]> Fold08
## 9 <split [112145/12460]> Fold09
## 10 <split [112145/12460]> Fold10
Next, letβs create our feature engineering recipe. Letβs tokenize using byte pair encoding; this is an algorithm that iteratively merges frequently occurring subword pairs and gets us information in between character-level and word-level. You can read more about byte pair encoding in this section of Supervised Machine Learning for Text Analysis in R.
library(textrecipes)
library(themis)
po_rec <- recipe(state ~ name, data = po_train) %>%
step_tokenize(name,
engine = "tokenizers.bpe",
training_options = list(vocab_size = 200)
) %>%
step_tokenfilter(name, max_tokens = 200) %>%
step_tf(name) %>%
step_normalize(all_predictors()) %>%
step_smote(state)
po_rec
## Data Recipe
##
## Inputs:
##
## role #variables
## outcome 1
## predictor 1
##
## Operations:
##
## Tokenization for name
## Text filtering for name
## Term frequency with name
## Centering and scaling for all_predictors()
## SMOTE based on state
We also are upsampling this very imbalanced data set via step_smote()
from the
themis package. The results of this data preprocessing show us the subword features.
po_rec %>%
prep() %>%
bake(new_data = NULL)
## # A tibble: 248,932 x 201
## `tf_name_-` tf_name_. `tf_name_'` `tf_name_'S` `tf_name_/` `tf_name_&`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 2 -0.0203 102. -0.0346 -0.165 -0.0102 -0.00567
## 3 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 4 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 5 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 6 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 7 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 8 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 9 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## 10 -0.0203 -0.0184 -0.0346 -0.165 -0.0102 -0.00567
## # β¦ with 248,922 more rows, and 195 more variables: tf_name_` <dbl>,
## # tf_name_β <dbl>, tf_name_βA <dbl>, tf_name_βAL <dbl>, tf_name_βB <dbl>,
## # tf_name_βBE <dbl>, tf_name_βBL <dbl>, tf_name_βBR <dbl>, tf_name_βC <dbl>,
## # tf_name_βCENT <dbl>, tf_name_βCH <dbl>, tf_name_βCITY <dbl>,
## # tf_name_βCL <dbl>, tf_name_βCO <dbl>, tf_name_βCREEK <dbl>,
## # tf_name_βD <dbl>, tf_name_βE <dbl>, tf_name_βEL <dbl>, tf_name_βF <dbl>,
## # tf_name_βG <dbl>, tf_name_βGRO <dbl>, tf_name_βGROVE <dbl>,
## # tf_name_βH <dbl>, tf_name_βHILL <dbl>, tf_name_βJ <dbl>, tf_name_βK <dbl>,
## # tf_name_βL <dbl>, tf_name_βLAKE <dbl>, tf_name_βLE <dbl>, tf_name_βM <dbl>,
## # tf_name_βMILL <dbl>, tf_name_βMILLS <dbl>, tf_name_βMO <dbl>,
## # tf_name_βMOUNT <dbl>, tf_name_βN <dbl>, tf_name_βNEW <dbl>,
## # tf_name_βNOR <dbl>, tf_name_βO <dbl>, tf_name_βP <dbl>, tf_name_βPO <dbl>,
## # tf_name_βR <dbl>, tf_name_βRI <dbl>, tf_name_βRO <dbl>, tf_name_βS <dbl>,
## # tf_name_βSH <dbl>, tf_name_βSP <dbl>, tf_name_βSPRING <dbl>,
## # tf_name_βSPRINGS <dbl>, tf_name_βST <dbl>, tf_name_βT <dbl>,
## # tf_name_βV <dbl>, tf_name_βW <dbl>, tf_name_βWEST <dbl>, tf_name_βWH <dbl>,
## # tf_name_1 <dbl>, tf_name_A <dbl>, tf_name_AD <dbl>, tf_name_AG <dbl>,
## # tf_name_AK <dbl>, tf_name_AKE <dbl>, tf_name_AL <dbl>, tf_name_ALE <dbl>,
## # tf_name_ALL <dbl>, tf_name_AM <dbl>, tf_name_AN <dbl>, tf_name_AND <dbl>,
## # tf_name_ANT <dbl>, tf_name_AP <dbl>, tf_name_AR <dbl>, tf_name_ARD <dbl>,
## # tf_name_ARK <dbl>, tf_name_ART <dbl>, tf_name_AS <dbl>, tf_name_AST <dbl>,
## # tf_name_AT <dbl>, tf_name_ATION <dbl>, tf_name_AW <dbl>, tf_name_AY <dbl>,
## # tf_name_B <dbl>, tf_name_BER <dbl>, tf_name_BUR <dbl>, tf_name_BURG <dbl>,
## # tf_name_C <dbl>, tf_name_CE <dbl>, tf_name_CH <dbl>, tf_name_CK <dbl>,
## # tf_name_CO <dbl>, tf_name_D <dbl>, tf_name_E <dbl>, tf_name_ED <dbl>,
## # tf_name_EE <dbl>, tf_name_EL <dbl>, tf_name_ELD <dbl>, tf_name_ELL <dbl>,
## # tf_name_EN <dbl>, tf_name_ENT <dbl>, tf_name_ER <dbl>, tf_name_ERS <dbl>,
## # tf_name_ES <dbl>, tf_name_EST <dbl>, β¦
Next letβs create a model specification for a linear support vector machine. This is a newer model in parsnip, currently in the development version on GitHub. Linear SVMs are often a good starting choice for text models.
svm_spec <- svm_linear() %>%
set_mode("classification") %>%
set_engine("LiblineaR")
svm_spec
## Linear Support Vector Machine Specification (classification)
##
## Computational engine: LiblineaR
Letβs put these together in a workflow.
po_wf <- workflow() %>%
add_recipe(po_rec) %>%
add_model(svm_spec)
po_wf
## ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Preprocessor: Recipe
## Model: svm_linear()
##
## ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## 5 Recipe Steps
##
## β step_tokenize()
## β step_tokenfilter()
## β step_tf()
## β step_normalize()
## β step_smote()
##
## ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Linear Support Vector Machine Specification (classification)
##
## Computational engine: LiblineaR
Now letβs fit this workflow (that combines feature engineering with the SVM model) to the resamples we created earlier. The linear SVM model does not support class probabilities, so we need to set a custom metric_set()
that only includes metrics for hard clss probabilities.
set.seed(234)
doParallel::registerDoParallel()
po_rs <- fit_resamples(
po_wf,
po_folds,
metrics = metric_set(accuracy, sens, spec)
)
How did we do?
collect_metrics(po_rs)
## # A tibble: 3 x 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 accuracy binary 0.971 10 0.000763 Preprocessor1_Model1
## 2 sens binary 0.755 10 0.0386 Preprocessor1_Model1
## 3 spec binary 0.972 10 0.000791 Preprocessor1_Model1
Not too bad, although you can tell we are doing better on one class than the other.
Fit and evaluate final model
Next, letβs fit our model on last time to the whole training set at once (rather than resampled data) and evaluate on the testing set. This is the first time we have touched the testing set.
final_fitted <- last_fit(
po_wf,
po_split,
metrics = metric_set(accuracy, sens, spec)
)
collect_metrics(final_fitted)
## # A tibble: 3 x 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.969 Preprocessor1_Model1
## 2 sens binary 0.811 Preprocessor1_Model1
## 3 spec binary 0.969 Preprocessor1_Model1
Our performance on the testing set is about the same as what we found with our resampled data, which is good.
We can explore how the model is doing for both the positive and negative classes with a confusion matrix.
collect_predictions(final_fitted) %>%
conf_mat(state, .pred_class)
## Truth
## Prediction Hawaii Other
## Hawaii 43 1273
## Other 10 40209
This just really emphasizes what an imbalanced problem this is, but we can see how well we are doing for the post offices in Hawaii vs.Β the rest of the country.
We are still in the process of building out support for exploring results for the LiblineaR model (like a tidy()
method) but in the meantime, you can get out the linear coefficients manually.
po_fit <- pull_workflow_fit(final_fitted$.workflow[[1]])
liblinear_obj <- po_fit$fit$W
liblinear_df <- tibble(
term = colnames(liblinear_obj),
estimate = liblinear_obj[1, ]
)
liblinear_df
## # A tibble: 201 x 2
## term estimate
## <chr> <dbl>
## 1 tf_name_- 0.0814
## 2 tf_name_. -0.0664
## 3 tf_name_' 0.0661
## 4 tf_name_'S 0.495
## 5 tf_name_/ -0.00675
## 6 tf_name_& -0.0198
## 7 tf_name_` -0.00264
## 8 tf_name_β -0.0419
## 9 tf_name_βA -0.0868
## 10 tf_name_βAL 0.201
## # β¦ with 191 more rows
Those term
items are the subwords that the byte pair encoding algorithm found for this data set. We can arrange()
them to see which are most important in each direction.
liblinear_df %>%
arrange(-estimate)
## # A tibble: 201 x 2
## term estimate
## <chr> <dbl>
## 1 Bias 5.52
## 2 tf_name_G 0.990
## 3 tf_name_βD 0.975
## 4 tf_name_RI 0.820
## 5 tf_name_βT 0.696
## 6 tf_name_AND 0.643
## 7 tf_name_IR 0.578
## 8 tf_name_GH 0.561
## 9 tf_name_VER 0.561
## 10 tf_name_AS 0.559
## # β¦ with 191 more rows
liblinear_df %>%
arrange(estimate)
## # A tibble: 201 x 2
## term estimate
## <chr> <dbl>
## 1 tf_name_A -0.407
## 2 tf_name_I -0.339
## 3 tf_name_O -0.330
## 4 tf_name_AN -0.320
## 5 tf_name_βH -0.319
## 6 tf_name_βP -0.281
## 7 tf_name_ALE -0.280
## 8 tf_name_U -0.259
## 9 tf_name_OWN -0.258
## 10 tf_name_βF -0.222
## # β¦ with 191 more rows
Or we can build a visualization.
liblinear_df %>%
filter(term != "Bias") %>%
group_by(estimate > 0) %>%
slice_max(abs(estimate), n = 15) %>%
ungroup() %>%
mutate(term = str_remove(term, "tf_name_")) %>%
ggplot(aes(estimate, fct_reorder(term, estimate), fill = estimate > 0)) +
geom_col(alpha = 0.6) +
geom_text(aes(label = term), family = "IBMPlexSans-Medium") +
scale_fill_discrete(labels = c("More from Hawaii", "Less from Hawaii")) +
scale_y_discrete(breaks = NULL) +
theme(axis.text.y = element_blank()) +
labs(
x = "Coefficient from linear SVM",
y = NULL,
fill = NULL,
title = "Which subwords in a US Post Office name are used more in Hawaii?",
subtitle = "Subwords like A, I, O, and AN are the strongest predictors of a post office being in Hawaii"
)
- Posted on:
- April 14, 2021
- Length:
- 10 minute read, 2045 words
- Categories:
- rstats tidymodels
- Tags:
- rstats tidymodels