Topic modeling for #TidyTuesday Spice Girls lyrics

Learn how to train, explore, and understand an unsupervised topic model for text data.

Predicting viewership for #TidyTuesday Doctor Who episodes

Using a tidymodels workflow can make many modeling tasks more convenient, but sometimes you want more flexibility and control of how to handle your modeling objects. Learn how to handle resampled workflow results and extract the quantities you are interested in.

Spatial resampling for #TidyTuesday and the #30DayMapChallenge

Use spatial resampling to more accurately estimate model performance for geographic data.

Predict #TidyTuesday giant pumpkin weights with workflowsets

Get started with tidymodels workflowsets to handle and evaluate multiple preprocessing and modeling approaches simultaneously, using pumpkin competitions.

Multiclass predictive modeling for #TidyTuesday NBER papers

Tune and evaluate a multiclass model with lasso regulariztion for economics working papers.

Dimensionality reduction for #TidyTuesday Billboard Top 100 songs

Songs on the Billboard Top 100 have many audio features. We can use data preprocessing recipes to implement dimensionality reduction and understand how these features are related.

Fit and predict with tidymodels for #TidyTuesday bird baths in Australia

In this screencast, focus on some tidymodels basics such as how to put together feature engineering and a model algorithm, and how to fit and predict.

Modeling human/computer interactions on Star Trek from #TidyTuesday with workflowsets

Learn how to evaluate multiple feature engineering and modeling approaches with workflowsets, predicting whether a person or the computer spoke a line on Star Trek.

Predict housing prices in Austin TX with tidymodels and xgboost

More xgboost with tidymodels! Learn about feature engineering to incorporate text information as indicator variables for boosted trees.

Supervised Machine Learning for Text Analysis in R is now complete

Our new book in the Chapman & Hall/CRC Data Science Series is now complete and available for preorder!

Tune xgboost models with early stopping to predict shelter animal status

Early stopping can keep an xgboost model from overfitting.

Use racing methods to tune xgboost models and predict home runs

Models like xgboost have many tuning hyperparameters, but racing methods can help identify parameter combinations that are not performing well.

Predict which #TidyTuesday Scooby Doo monsters are REAL with a tuned decision tree model

Which Scooby Doo monsters are REAL?! Walk through how to tune and then choose a decision tree model, as well as how to visualize and evaluate the results.

Create a custom metric with tidymodels and NYC Airbnb prices

Predict prices for Airbnb listings in NYC with a data set from a recent episode of SLICED, with a focus on two specific aspects of this model analysis: creating a custom metric to evaluate the model and combining both tabular and unstructured text data in one model.

Class imbalance and classification metrics with aircraft wildlife strikes

Handling class imbalance in modeling affects classification metrics in different ways. Learn how to use tidymodels to subsample for class imbalance, and how to estimate model performance using resampling.

Partial dependence plots with tidymodels and DALEX for #TidyTuesday Mario Kart world records

Tune a decision tree model to predict whether a Mario Kart world record used a shortcut, and explore partial dependence profiles for the world record times.

Predict availability in #TidyTuesday water sources with random forest models

Walk through a tidymodels analysis from beginning to end to predict whether water is available at a water source in Sierra Leone.

Estimate change in #TidyTuesday CEO departures with bootstrap resampling

Are more CEO departures involuntary now than in the past? We can use tidymodels' bootstrap resampling and generalized linear models to understand change over time.

Which #TidyTuesday Netflix titles are movies and which are TV shows?

Use tidymodels to build features for modeling from Netflix description text, then fit and evaluate a support vector machine model.

Which #TidyTuesday post offices are in Hawaii?

Use tidymodels to predict post office location with subword features and a support vector machine model.

Dimensionality reduction of #TidyTuesday United Nations voting patterns

Explore country-level UN voting with a tidymodels approach to unsupervised machine learning.

Bootstrap confidence intervals for #TidyTuesday Super Bowl commercials

Estimate how commercial characteristics like humor and patriotic themes change with time using tidymodels functions for bootstrap confidence intervals.

Getting started with k-means and #TidyTuesday employment status

Use tidy data principles to understand which kinds of occupations are most similar in terms of demographic characteristics.

Understand your models with #TidyTuesday inequality in student debt

Explore results of models with convenient tidymodels functions.

Learn tidytext with my new learnr course

I am happy to announce that this free, open source, interactive course on text mining with tidy data principles is now published!

Explore art media over time in the #TidyTuesday Tate collection dataset

Check residuals and other model diagnostics for regression models trained on text features, all with tidymodels functions.

Predicting injuries for Chicago traffic crashes

Download up-to-date city data from Chicago’s open data portal and predict whether a traffic crash involved an injury with a bagged tree model.

Upcoming changes to tidytext: threat of COLLAPSE

The current development version of tidytext has changes that may affect your analyses.

Tune random forests for #TidyTuesday IKEA prices

Use tidymodels scaffolding functions for getting started quickly with commonly used models like random forests.

Tune and interpret decision trees for #TidyTuesday wind turbines

Use tidymodels to predict capacity for Canadian wind turbines with decision trees.

Predicting class membership for the #TidyTuesday Datasaurus Dozen

Which of the Datasaurus Dozen are easier or harder for a random forest model to identify? Learn how to use multiclass evaluation metrics to find out.

Modeling #TidyTuesday NCAA women’s basketball tournament seeds

Tune a hyperparameter and then understand how to choose the best value afterward, using tidymodels for modeling the relationship between expected wins and tournament seed.

Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels

Use tidymodels for feature engineering steps like imputing missing data and subsampling for class imbalance, and build predictive models to predict the probability of survival for Himalayan climbers.

Introducing our new book, Tidy Modeling with R

An initial version of the first eleven chapters are available today! Look for more chapters to be released in the near future.

Train and analyze many models for #TidyTuesday crop yields

Learn how to use tidyverse and tidymodels functions to fit and analyze many models at once.

Build a #TidyTuesday predictive text model for The Last Airbender

Use text features and tidymodels to predict the speaker of individual lines from the show, and learn how to compute model-agnostic variable importance for any kind of model.

Get started with tidymodels and #TidyTuesday Palmer penguins

Build two kinds of classification models and evaluate them using resampling.

Supervised Machine Learning for Text Analysis in R

Announcing our new book, to be published in the Chapman & Hall/CRC Data Science Series!

Bagging with tidymodels and #TidyTuesday astronaut missions

Learn how to use bootstrap aggregating to predict the duration of astronaut missions.

The Bechdel test and the X-Mansion with tidymodels and #TidyTuesday

Explore data from the Claremont Run Project on Uncanny X-Men with bootstrap resampling.

Impute missing data for #TidyTuesday voyages of captive Africans with tidymodels

Understand more about the forced transport of African people using the Slave Voyages database.

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes

Use tidymodels for unsupervised dimensionality reduction.

tidylo is now on CRAN! 🎉

Measure how the frequency of some feature differs across some group or set, using the weighted log odds.

Tune XGBoost with tidymodels and #TidyTuesday beach volleyball

Learn how to tune hyperparameters for an XGBoost classification model to predict wins and losses.

Learn tidymodels with my supervised machine learning course

I am happy to announce that a new version of my free, online, interactive course has been published!

Multinomial classification with tidymodels and #TidyTuesday volcano eruptions

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast demonstrates how to implement multiclass or multinomial classification using with this week’s #TidyTuesday dataset on volcanoes. 🌋 Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal is to predict the type of volcano from this week’s #TidyTuesday dataset based on other volcano characteristics like latitude, longitude, tectonic setting, etc.

Sentiment analysis with tidymodels and #TidyTuesday Animal Crossing reviews

A lot has been happening in the tidymodels ecosystem lately! There are many possible projects we on the tidymodels team could focus on next; we are interested in gathering community feedback to inform our priorities. If you are interested in sharing your opinion on next steps in tidymodels development, please take this short survey. Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models.

Modeling #TidyTuesday GDPR violations with tidymodels

This is an exciting week for us on the tidymodels team; we launched tidymodels.org, a new central location with resources and documentation for tidymodels packages. There is a TON to explore and learn there! 🚀 You can check out the official blog post for more details. Today, I’m publishing here on my blog another screencast demonstrating how to use tidymodels. This is a good video for folks getting started with tidymodels, using this week’s #TidyTuesday dataset on GDPR violations.

PCA and the #TidyTuesday best hip hop songs ever

Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m exploring a different part of the tidymodels framework; I’m showing how to implement principal component analysis via recipes with this week’s #TidyTuesday dataset on the best hip hop songs of all time as determinded by a BBC poll of music critics. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Bootstrap resampling with #TidyTuesday beer production data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on beer production to show how to use bootstrap resampling to estimate model parameters. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Tuning random forest hyperparameters with #TidyTuesday trees data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model. Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

LASSO regression using tidymodels and #TidyTuesday data for The Office

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Preprocessing and resampling using #TidyTuesday college data

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s #TidyTuesday dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

Hyperparameter tuning and #TidyTuesday food consumption

Last week I published a screencast demonstrating how to use the tidymodels framework and specifically the recipes package. Today, I’m using this week’s #TidyTuesday dataset on food consumption around the world to show hyperparameter tuning! Here is the code I used in the video, for those who prefer reading instead of or in addition to video. Explore the data Our modeling goal here is to predict which countries are Asian countries and which countries are not, based on their patterns of food consumption in the eleven categories from the #TidyTuesday dataset.

#TidyTuesday hotel bookings and recipes

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models! Here is the code I used in the video, for those who prefer reading instead of or in addition to video.

#TidyTuesday and tidymodels

This week I started my new job as a software engineer at RStudio, working with Max Kuhn and other folks on tidymodels. I am really excited about tidymodels because my own experience as a practicing data scientist has shown me some of the areas for growth that still exist in open source software when it comes to modeling and machine learning. Almost nothing has had the kind of dramatic impact on my productivity that the tidyverse and other RStudio investments have had; I am enthusiastic about contributing to that kind of user-focused transformation for modeling and machine learning.

Opioid prescribing habits in Texas

A paper I worked on was just published in a medical journal. This is quite an odd thing for me to be able to say, given my academic background and the career path I have had, but there you go! The first author of this paper is a long-time friend of mine working in anesthesiology and pain management, and he obtained data from the Texas Prescription Drug Monitoring Program (PDMP) about controlled substance prescriptions from April 2015 to 2018.

Text Mining in R: A Tidy Approach

I spoke on approaching text mining tasks using tidy data principles at rstudio::conf yesterday. I was so happy to have the opportunity to speak and the conference has been a great experience. If you want to catch up on what has been going on at rstudio::conf, Karl Broman put together a GitHub repo of slides and Sharon Machlis has been live-blogging the conference at Computerworld. A highlight for me was Andrew Flowers' talk on data journalism and storytelling; I don’t work in data journalism but I think I can apply almost everything he said to how I approach what I do.

A Beginner’s Guide to Travis-CI for R

The Blind Leading the Blind

Joy to the World, and also Anticipation, Disgust, Surprise…

In my previous blog post, I analyzed my Twitter archive and explored some aspects of my tweeting behavior. When do I tweet, how much do retweet people, do I use hashtags? These are examples of one kind of question, but what about the actual verbal content of my tweets, the text itself? What kinds of questions can we ask and answer about the text in some programmatic way? This is what is called natural language processing, and I’ll give a first shot at it here.

Ten Thousand Tweets

I started learning the statistical programming language R this past summer, and discovering Hadley Wickham’s data visualization package ggplot2 has been a joy and a revelation. When I think back to how I made all the plots for my astronomy dissertation in the early 2000s (COUGH SUPERMONGO COUGH), I feel a bit in awe of what ggplot2 can do and how easy and, might I even say, delightful it is to use.

rstats