Introduction to machine learning with {tidymodels}

R/Pharma 2023

Dr Nicola Rennie

Welcome!

What to expect during this workshop

The workshop will run for 2 hours.

  • Combines slides, live coding examples, quiz questions, and exercises for you to participate in.

  • Ask questions in the chat throughout!

What to expect during this workshop


I hope you end up with more questions than answers after this workshop!


Stranger Things questions gif

Source: giphy.com

Workshop resources

Data

We’ll use two data sets in this workshop:

heart_failure <- readr::read_csv("data/heart_failure.csv") 


  • Exercises data: Physical Exercise in Patients with Subacute Stroke (PHYS-STROKE): safety analyses of six-month follow-up of a randomized clinical trial. doi.org/10.5281/zenodo.3899830
exercises <- readr::read_csv("data/exercises.csv") 

Quiz!

Have you done machine learning in R before?

  • No

  • Yes, with {tidymodels}

  • Yes, with {caret}

  • Yes, with something else

Getting started with {tidymodels}

What is {tidymodels}?

  • A collection of R packages for statistical modelling and machine learning.

  • Follows the {tidyverse} principles.

  • install.packages("tidymodels")

tidymodels R package hex sticker logo

What is {tidymodels}?

There are some core {tidymodels} packages…


… and plenty of extensions!

What is machine learning?

Machine learning diagram

  • Learning from data

  • Mostly used to make predictions or classifications.

What is machine learning?

Machine learning diagram

  • Learning from data

  • Mostly used to make predictions or classifications.

Types of machine learning

Supervised learning: requires labelled input data

  • Classification

  • Regression-based models

Unsupervised learning: does not require labelled input data

  • Clustering

  • Association rules

Other types of machine learning include semi-supervised learning and reinforcement learning.

Before we start fitting models…

Training and testing data

Training and testing diagram

Hyperparameter tuning

We can’t always learn every parameter from the data.

Workflows and recipes

recipes package hex sticker

Recipe

A series of preprocessing steps performed on data before you fit a model.


Workflow

An object that can combine your pre-processing, modelling, and post-processing steps. E.g. combine a recipe with a model.

Pre-processing in {tidymodels}

Live demo!



See examples/example_01.R for full code.

Exercise 1

Open exercises/exercise_01.R for prompts.

  • Load the {tidyverse} and {tidymodels} packages

  • Read in the exercises.csv data

  • View and explore the data

  • Perform the initial split (choose your own proportion!)

  • Create some cross-validation folds

  • Build a recipe and workflow

10:00

See exercise_solutions/exercise_solutions_01.R for full code.

LASSO regression

Linear and logistic regression models

Let’s go back a little bit first…

Linear regression

lm(y ~ x, data = model_data)

Linear regression plot

Logistic regression

glm(y ~ x, family = "binomial", data = model_data)

Linear regression plot

Quiz!

How do you choose which explanatory variables to include?

  • Using background knowledge

  • p-values and correlations between variables

  • Stepwise procedures (forward/backward/bi-directional)

  • Something else

LASSO regression

Standard regression: minimise distance between predicted and observed values


Least Absolute Shrinkage and Selection Operator (LASSO): minimise (distance between predicted and observed values + \(\lambda\) \(\times\) sum of coefficients)


See also: ridge regression

Hyperparameters for LASSO regression

\(\lambda\) (penalty) takes a value between 0 and \(\infty\).

  • Higher value: more coefficients are pushed towards zero

  • Lower value: closer to standard regression models. (\(\lambda = 0\) ~ standard regression model)

Model evaluation

(Binary) Classification Metrics

  • Accuracy: proportion of the data that are predicted correctly.

  • ROC AUC: area under the ROC (receiver operating characteristic) curve.

  • Kappa: similar to accuracy but normalised by the accuracy expected by chance alone.

See yardstick.tidymodels.org/articles/metric-types.html.

LASSO logistic regression in {tidymodels}

Live demo!



See examples/example_02.R for full code.

Exercise 2

Open exercises/exercise_02.R for prompts. You can also use examples/example_02.R as a starting point.

  • Specify the model using logistic_reg().

  • Tune the hyperparameter.

  • Choose the best value and fit the final model.

  • Evaluate the model performance.

10:00

See exercise_solutions/exercise_solutions_02.R for full code.

Quiz!

What was the ROC AUC of your LASSO model?

  • Less than 70%

  • 70-80%

  • 80-90%

  • 90-100%

Random Forests

Decision trees

A tree-like model of decisions and their possible consequences.

Decision tree about walking to work and the weather

What are Random Forests?

  • An ensemble method

  • Combines many decision trees.

  • Can be used for classification or regression problems.

  • For classification tasks, the output of the random forest is the class selected by most trees.

Hyperparameters for random forests

trees: number of trees in the ensemble.


mtry: number of predictors that will be randomly sampled at each split when creating the tree models.


min_n: minimum number of data points in a node that are required for the node to be split further.

Random Forests in {tidymodels}

Live demo!



See examples/example_03.R for full code.

Exercise 3

Open exercises/exercise_03.R for prompts. You can also use examples/example_03.R as a starting point.

  • Specify a random forest model using rand_forest()

  • Tune the hyperparameters using the cross-validation folds.

  • Fit the final model and evaluate it.

10:00

See exercise_solutions/exercise_solutions_03.R for full code.

Support Vector Machines (SVM)

What are Support Vector Machines?

Support Vector Machines (SVMs) draw a decision boundary that best separates two groups.

Scatter plot with two groups

What are Support Vector Machines?

Support Vector Machines (SVMs) draw a decision boundary that best separates two groups.

Scatter plot with two groups with dividing line

Types of SVM

There are different types of kernel functions, including:

Linear

svm_linear()

linear kernel plot

Polynomial

svm_poly()

polynomial kernel plot

Radial Basis Functions

svm_rbf()

RBF kernel plot

Hyperparameters in SVMs

Cost

  • Higher value: emphasises fitting the data

  • Lower value: prioritises avoiding overfitting

Gamma (shape and smoothness of decision boundary)

  • Higher value: more flexible boundaries

  • Lower value: simpler boundaries

There may be other hyperparameters, depending on the choice of kernel.

Support Vector Machines in {tidymodels}

Live demo!



See examples/example_04.R for full code.

Exercise 4

Open exercises/exercise_04.R for prompts. You can also use examples/example_04.R as a starting point.

  • Specify a support vector machine using svm_rbf() (or one of the other svm_* functions if you’re feeling confident!)

  • Tune the cost() hyperparameter using the cross-validation folds.

  • Fit the final model and evaluate it.

  • Look at some other evaluation metrics.

10:00

See exercise_solutions/exercise_solutions_04.R for full code.

Quiz

Which of the three models performed best on the exercises data?

  • LASSO

  • Random Forest

  • SVM

Additional Information

Learning more about machine learning

There are many machine learning topics and {tidymodels} functions we haven’t covered today. As well as learning about the technical details of models and how to write code, it’s important to learn about:

  • Ethics

  • Bias and discrimination

  • Explainability and validation

Additional resources

Workshop resources