Random Forest using `tidymodels`

Catalina Canizares

Agenda

🌲 Developing the intuition

🌲 Random Forest

🌲 Bagging

🌲 Random forest in tidymodels

🌲 It gets better with XGboost

🌲 Comparing and choosing a model

🌲 Final fit

Instructions

Listen carefully to the group you will be assigned to
DO NOT look for the correct answer

No google 👀
No friend 👀
No neighbor 👀

Developing the intuition

INDIVIDUALLY

How many people attended the last Taylor Swift concert at Pittsburgh stadium during Eras Tour concert?

app

The Condorcet Jury Theorem

🌳 If each person is more than 50% correct, then adding more people to vote increases the probability that the majority is correct.

🌳 The theorem suggests that the probability of the majority vote being correct can go up as we add more and more models

Random Forest 🌳🌲🌴

🌳 Random Forest is just a bunch of Decision Trees bundled together.

🌳 The idea is if we have a “weak” algorithm like a decision tree, if we make a lot of different models using this weak algorithm and average the result of their prediction, then the final result will be much better.

🌳 This is called Ensemble Learning

Ensemble Predictions

🌴 Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.

🌳 Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.

🌲 Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.

Bagging

🌲 One way to produce multiple models that are different is to train each model using a different training set.

🌲 The Bagging (Bootstrap Aggregating) method randomly draws a fixed number of samples from the training set with replacement.

🌲 The algorithm randomly samples people to build a tree and it also will randomly select variables to check when making a new node.

The process 🌳🌲🌴:

Select random samples from a given training set.
The algorithm will construct a decision tree for every training data
Voting will take place by averaging the decision tree
Select the most voted prediction result as the final model

The total contribution to purity
How often a variable is selected as a node across all the trees
How high (close to the root) it shows up across the trees

An example

We will build a random forest model to classify if a road sign is a pedestrian crossing sign or not.

Our features are: Size, number of sides, number of colors used, and if the sign has text or symbol.

Click here

Random Forest with `tidymodels`

Task

Predict whether an adolescent has consumed alcohol or not based on a set of various risk behaviors.

Data Cleaning

data("riskyBehaviors")

riskyBehaviors_analysis <- 
  riskyBehaviors |> 
  mutate(UsedAlcohol = case_when(
    AgeFirstAlcohol == 1 ~ 0, 
    AgeFirstAlcohol %in% c(2, 3, 5, 6, 4, 7) ~ 1, 
    TRUE ~ NA
    )) |> 
  mutate(UsedAlcohol = factor(UsedAlcohol)) |> 
  drop_na(UsedAlcohol) |> 
  select(- c(AgeFirstAlcohol, DaysAlcohol, BingeDrinking, LargestNumberOfDrinks, SourceAlcohol, SourceAlcohol))

Splitting the data

set.seed(2023)

alcohol_split <- initial_split(riskyBehaviors_analysis, 
                               strata = UsedAlcohol)

alcohol_train <- training(alcohol_split)
alcohol_test <- testing(alcohol_split)

alcohol_split

<Training/Testing/Total>
<9889/3297/13186>

Lets Check Our Work

alcohol_train |> 
  tabyl(UsedAlcohol)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()

 UsedAlcohol    n percent
           0 4354     44%
           1 5535     56%
       Total 9889       -

alcohol_test |>  
  tabyl(UsedAlcohol)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()

 UsedAlcohol    n percent
           0 1452     44%
           1 1845     56%
       Total 3297       -

Creating the Resampling Object

set.seed(2023)

cv_alcohol <- rsample::vfold_cv(alcohol_train, 
                                v= 5,
                                strata = UsedAlcohol)
cv_alcohol

#  5-fold cross-validation using stratification 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [7911/1978]> Fold1
2 <split [7911/1978]> Fold2
3 <split [7911/1978]> Fold3
4 <split [7911/1978]> Fold4
5 <split [7912/1977]> Fold5

The Recipe

alcohol_recipe <- 
  recipe(formula = UsedAlcohol ~ ., data = alcohol_train) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors())

The Specification

ranger_spec <- 
  rand_forest(
    # the number of predictors to sample at each split
    mtry = tune(), 
    # the number of observations needed to keep splitting nodes
    min_n = tune(),
    trees = 100) |>  
  set_mode("classification") |>  
  set_engine("ranger", 
             # This is essential for vip()
             importance = "permutation") 

ranger_spec

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = tune()

Engine-Specific Arguments:
  importance = permutation

Computational engine: ranger

The Workflow

ranger_workflow <- 
  workflow() |> 
  add_recipe(alcohol_recipe) |>  
  add_model(ranger_spec) 

ranger_workflow

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = tune()

Engine-Specific Arguments:
  importance = permutation

Computational engine: ranger

Tuning

doParallel::registerDoParallel()
  
set.seed(46257)
  
ranger_tune <-
  tune_grid(
    ranger_workflow,
    resamples = cv_alcohol,
# grid = 11 says to choose 11 parameter sets automatically 
    grid = 11)

doParallel::stopImplicitCluster()

Collect the tunning Metrics

collect_metrics(ranger_tune)

# A tibble: 22 × 8
    mtry min_n .metric  .estimator  mean     n std_err .config              
   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1    46    28 accuracy binary     0.793     5 0.00237 Preprocessor1_Model01
 2    46    28 roc_auc  binary     0.859     5 0.00206 Preprocessor1_Model01
 3    22     7 accuracy binary     0.789     5 0.00381 Preprocessor1_Model02
 4    22     7 roc_auc  binary     0.857     5 0.00155 Preprocessor1_Model02
 5    18    15 accuracy binary     0.794     5 0.00288 Preprocessor1_Model03
 6    18    15 roc_auc  binary     0.861     5 0.00148 Preprocessor1_Model03
 7    32    10 accuracy binary     0.791     5 0.00266 Preprocessor1_Model04
 8    32    10 roc_auc  binary     0.856     5 0.00138 Preprocessor1_Model04
 9    57    18 accuracy binary     0.794     5 0.00253 Preprocessor1_Model05
10    57    18 roc_auc  binary     0.856     5 0.00189 Preprocessor1_Model05
# ℹ 12 more rows

Visualize the Metrics

autoplot(ranger_tune)

Select best hyperparameters

best <- select_best(ranger_tune, metric = "roc_auc")
best

# A tibble: 1 × 3
   mtry min_n .config              
  <int> <int> <chr>                
1     2    39 Preprocessor1_Model11

Finalize the Workflow

final_wf <- finalize_workflow(ranger_workflow, best)
final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 2
  trees = 100
  min_n = 39

Engine-Specific Arguments:
  importance = permutation

Computational engine: ranger

Fit the model in the training set

alcohol_fit <- fit(final_wf, alcohol_train)
alcohol_fit

Checking predictions in the training set

alcohol_pred <- 
  augment(alcohol_fit, alcohol_train) |> 
  select(UsedAlcohol, .pred_class, .pred_1, .pred_0)

alcohol_pred

# A tibble: 9,889 × 4
   UsedAlcohol .pred_class .pred_1 .pred_0
   <fct>       <fct>         <dbl>   <dbl>
 1 0           1             0.654   0.346
 2 0           0             0.265   0.735
 3 0           0             0.215   0.785
 4 0           0             0.213   0.787
 5 0           1             0.848   0.152
 6 0           1             0.661   0.339
 7 0           0             0.314   0.686
 8 0           0             0.304   0.696
 9 0           0             0.412   0.588
10 0           0             0.205   0.795
# ℹ 9,879 more rows

Check the Performance

roc_plot <- 
  alcohol_pred |> 
  roc_curve(truth = UsedAlcohol, 
           .pred_1, 
           event_level = "second") |> 
  autoplot()

roc_plot

alcohol_pred |> 
  roc_auc(truth = UsedAlcohol, 
           .pred_1, 
           event_level = "second")

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.902

Can it get better?

XGBoost

short for “Extreme Gradient Boosting,”
It belongs to the family of boosting algorithms, which means it combines multiple weaker models.
Works by iteratively building decision trees and then combining them in a smart way.
It focuses on reducing errors by analyzing the residuals (the differences between predicted and actual values) at each iteration and adjusting subsequent trees accordingly.

The Specification

xgb_spec <- boost_tree(
  trees = 100,
  tree_depth = tune(), min_n = tune(),
  loss_reduction = tune(),                     
  sample_size = tune(), mtry = tune(),        
  learn_rate = tune()                          
 ) |> 
  set_engine("xgboost") |> 
  set_mode("classification")

xgb_spec

Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()
  loss_reduction = tune()
  sample_size = tune()

Computational engine: xgboost

The Workflow

xgb_wf <- 
  workflow() |> 
  add_recipe(alcohol_recipe) |>  
  add_model(xgb_spec) 

xgb_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = 100
  min_n = tune()
  tree_depth = tune()
  learn_rate = tune()
  loss_reduction = tune()
  sample_size = tune()

Computational engine: xgboost

Tuning

doParallel::registerDoParallel()

set.seed(234)
xgb_res <- tune_grid(
  xgb_wf,
  resamples = cv_alcohol,
  grid = 11,
  control = control_grid(save_pred = TRUE)
)

doParallel::stopImplicitCluster()

xgb_res

Collect the tunning Metrics

collect_metrics(xgb_res)

# A tibble: 22 × 12
    mtry min_n tree_depth learn_rate loss_reduction sample_size .metric 
   <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>   
 1    42     5         11    0.00124  0.000000157         0.349 accuracy
 2    42     5         11    0.00124  0.000000157         0.349 roc_auc 
 3    39    15          9    0.0744   0.00237             0.819 accuracy
 4    39    15          9    0.0744   0.00237             0.819 roc_auc 
 5    20    19          2    0.0374   0.00000612          0.705 accuracy
 6    20    19          2    0.0374   0.00000612          0.705 roc_auc 
 7    31     8          5    0.0190   0.0000403           0.224 accuracy
 8    31     8          5    0.0190   0.0000403           0.224 roc_auc 
 9     3    38          7    0.00414  0.00000000990       0.585 accuracy
10     3    38          7    0.00414  0.00000000990       0.585 roc_auc 
# ℹ 12 more rows
# ℹ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
#   .config <chr>

Visualize the Metrics

autoplot(xgb_res)

Select best hyperparameters

best_xg <- select_best(xgb_res, metric = "roc_auc")
best_xg

# A tibble: 1 × 7
   mtry min_n tree_depth learn_rate loss_reduction sample_size .config          
  <int> <int>      <int>      <dbl>          <dbl>       <dbl> <chr>            
1    53    29          5      0.180          0.250       0.489 Preprocessor1_Mo…

Finalize the Workflow

final_wf_xg <- finalize_workflow(xgb_wf, best_xg)
final_wf_xg

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: boost_tree()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Boosted Tree Model Specification (classification)

Main Arguments:
  mtry = 53
  trees = 100
  min_n = 29
  tree_depth = 5
  learn_rate = 0.180162087183753
  loss_reduction = 0.249654965980625
  sample_size = 0.489060359198431

Computational engine: xgboost

Fit the model in the training set

alcohol_fit_xg <- fit(final_wf_xg, alcohol_train)
alcohol_fit_xg

Checking predictions in the training set

alcohol_pred_xg <- 
  augment(alcohol_fit_xg, alcohol_train) |> 
  select(UsedAlcohol, .pred_class, .pred_1, .pred_0)

alcohol_pred_xg

# A tibble: 9,889 × 4
   UsedAlcohol .pred_class .pred_1 .pred_0
   <fct>       <fct>         <dbl>   <dbl>
 1 0           1             0.692  0.308 
 2 0           0             0.176  0.824 
 3 0           0             0.118  0.882 
 4 0           0             0.117  0.883 
 5 0           1             0.937  0.0627
 6 0           1             0.697  0.303 
 7 0           0             0.184  0.816 
 8 0           0             0.308  0.692 
 9 0           0             0.285  0.715 
10 0           0             0.142  0.858 
# ℹ 9,879 more rows

Check the Performance

roc_plot_xg <- 
  alcohol_pred_xg |> 
  roc_curve(truth = UsedAlcohol, 
           .pred_1, 
           event_level = "second") |> 
  autoplot()

roc_plot_xg

alcohol_pred_xg |> 
  roc_auc(truth = UsedAlcohol, 
           .pred_1, 
           event_level = "second")

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.888

Let’s compare the perfomance

Last fit in Random Forest

alcohol_last_fit <- 
  last_fit(final_wf, 
           split = alcohol_split, 
           metrics = metric_set(kap, roc_auc, sens, spec))


alcohol_last_fit

Metrics for Random Forest in Testing Data

collect_metrics(alcohol_last_fit)

# A tibble: 4 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 kap     binary         0.593 Preprocessor1_Model1
2 sens    binary         0.793 Preprocessor1_Model1
3 spec    binary         0.803 Preprocessor1_Model1
4 roc_auc binary         0.870 Preprocessor1_Model1

Predictions in the testing set

predictions_testing <- 
  alcohol_last_fit |> 
  collect_predictions() |> 
  select(-.config, -.row)

predictions_testing

# A tibble: 3,297 × 5
   id               .pred_class .pred_0 .pred_1 UsedAlcohol
   <chr>            <fct>         <dbl>   <dbl> <fct>      
 1 train/test split 0            0.688    0.312 0          
 2 train/test split 0            0.529    0.471 1          
 3 train/test split 1            0.487    0.513 0          
 4 train/test split 1            0.246    0.754 1          
 5 train/test split 1            0.0676   0.932 1          
 6 train/test split 1            0.173    0.827 1          
 7 train/test split 0            0.652    0.348 1          
 8 train/test split 0            0.642    0.358 0          
 9 train/test split 1            0.0782   0.922 1          
10 train/test split 1            0.0247   0.975 1          
# ℹ 3,287 more rows

Confusion Matrix in the testing set

conf_mat_test <- 
predictions_testing |> 
  conf_mat(UsedAlcohol, .pred_class) |> 
  autoplot(type = "heatmap")
conf_mat_test

Make sure your metrics are interpretable

collect_metrics(alcohol_last_fit) |> 
  select(-.estimator, -.config)

# A tibble: 4 × 2
  .metric .estimate
  <chr>       <dbl>
1 kap         0.593
2 sens        0.793
3 spec        0.803
4 roc_auc     0.870

Make sure your metrics are interpretable

collect_metrics(alcohol_last_fit) |> 
  select(-.estimator, -.config)

# A tibble: 4 × 2
  .metric .estimate
  <chr>       <dbl>
1 kap         0.593
2 sens        0.793
3 spec        0.803
4 roc_auc     0.870

http://araw.mede.uic.edu/cgi-bin/testcalc.pl

Make sure your metrics are interpretable

multi_metric <- metric_set(sens, spec, accuracy, kap)

multi_metric(predictions_testing, 
             truth = UsedAlcohol, 
             estimate = .pred_class, 
             event_level = "second")

# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.803
2 spec     binary         0.793
3 accuracy binary         0.799
4 kap      binary         0.593

Compare ROC-AUC in training and testing

Variable Importance Plot

library(vip)

alcohol_last_fit |> 
  extract_fit_parsnip() |> 
  vi()  |> 
  slice_max(Importance, n = 10) |> 
  ggplot(aes(Importance, fct_reorder(Variable, Importance))) +
  geom_col() +
  labs(y = NULL) +
  theme(legend.position = "none")

Random Forest using tidymodels

Agenda

Instructions

Developing the intuition

The Condorcet Jury Theorem

Random Forest 🌳🌲🌴

Ensemble Predictions

Bagging

The process 🌳🌲🌴:

An example

Random Forest with tidymodels

Task

Data Cleaning

Splitting the data

Lets Check Our Work

Creating the Resampling Object

The Recipe

The Specification

The Workflow

Tuning

Collect the tunning Metrics

Visualize the Metrics

Select best hyperparameters

Finalize the Workflow

Fit the model in the training set

Checking predictions in the training set

Check the Performance

Can it get better?

XGBoost

The Specification

The Workflow

Tuning

Collect the tunning Metrics

Visualize the Metrics

Select best hyperparameters

Finalize the Workflow

Fit the model in the training set

Checking predictions in the training set

Check the Performance

Let’s compare the perfomance

Last fit in Random Forest

Metrics for Random Forest in Testing Data

Predictions in the testing set

Confusion Matrix in the testing set

Make sure your metrics are interpretable

Make sure your metrics are interpretable

Make sure your metrics are interpretable

Compare ROC-AUC in training and testing

Variable Importance Plot

We did it!

Random Forest using `tidymodels`

Random Forest with `tidymodels`