Random Forest using tidymodels
Agenda
π² Developing the intuition
π² Random Forest
π² Bagging
π² Random forest in tidymodels
π² It gets better with XGboost
π² Comparing and choosing a model
π² Final fit
Instructions
Listen carefully to the group you will be assigned to
DO NOT look for the correct answer
No google π
No friend π
No neighbor π
Developing the intuition
INDIVIDUALLY
How many people attended the last Taylor Swift concert at Pittsburgh stadium during Eras Tour concert?
The Condorcet Jury Theorem
π³ If each person is more than 50% correct, then adding more people to vote increases the probability that the majority is correct.
π³ The theorem suggests that the probability of the majority vote being correct can go up as we add more and more models
Random Forest π³π²π΄
π³ Random Forest is just a bunch of Decision Trees bundled together.
π³ The idea is if we have a βweakβ algorithm like a decision tree, if we make a lot of different models using this weak algorithm and average the result of their prediction, then the final result will be much better.
π³ This is called Ensemble Learning
Ensemble Predictions
π΄ Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
π³ Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
π² Stacking. Building multiple models (typically of differing types ) and supervisor model that learns how to best combine the predictions of the primary models.
Bagging
π² One way to produce multiple models that are different is to train each model using a different training set.
π² The Bagging (Bootstrap Aggregating) method randomly draws a fixed number of samples from the training set with replacement.
π² The algorithm randomly samples people to build a tree and it also will randomly select variables to check when making a new node.
The process π³π²π΄:
Select random samples from a given training set.
The algorithm will construct a decision tree for every training data
Voting will take place by averaging the decision tree
Select the most voted prediction result as the final model
The total contribution to purity
How often a variable is selected as a node across all the trees
How high (close to the root) it shows up across the trees
An example
We will build a random forest model to classify if a road sign is a pedestrian crossing sign or not.
Our features are: Size, number of sides, number of colors used, and if the sign has text or symbol.
Click here
Random Forest with tidymodels
Task
Predict whether an adolescent has consumed alcohol or not based on a set of various risk behaviors.
Data Cleaning
data ("riskyBehaviors" )
riskyBehaviors_analysis <-
riskyBehaviors |>
mutate (UsedAlcohol = case_when (
AgeFirstAlcohol == 1 ~ 0 ,
AgeFirstAlcohol %in% c (2 , 3 , 5 , 6 , 4 , 7 ) ~ 1 ,
TRUE ~ NA
)) |>
mutate (UsedAlcohol = factor (UsedAlcohol)) |>
drop_na (UsedAlcohol) |>
select (- c (AgeFirstAlcohol, DaysAlcohol, BingeDrinking, LargestNumberOfDrinks, SourceAlcohol, SourceAlcohol))
Splitting the data
set.seed (2023 )
alcohol_split <- initial_split (riskyBehaviors_analysis,
strata = UsedAlcohol)
alcohol_train <- training (alcohol_split)
alcohol_test <- testing (alcohol_split)
alcohol_split
<Training/Testing/Total>
<9889/3297/13186>
Lets Check Our Work
alcohol_train |>
tabyl (UsedAlcohol) |>
adorn_pct_formatting (0 ) |>
adorn_totals ()
UsedAlcohol n percent
0 4354 44%
1 5535 56%
Total 9889 -
alcohol_test |>
tabyl (UsedAlcohol) |>
adorn_pct_formatting (0 ) |>
adorn_totals ()
UsedAlcohol n percent
0 1452 44%
1 1845 56%
Total 3297 -
Creating the Resampling Object
set.seed (2023 )
cv_alcohol <- rsample:: vfold_cv (alcohol_train,
v= 5 ,
strata = UsedAlcohol)
cv_alcohol
# 5-fold cross-validation using stratification
# A tibble: 5 Γ 2
splits id
<list> <chr>
1 <split [7911/1978]> Fold1
2 <split [7911/1978]> Fold2
3 <split [7911/1978]> Fold3
4 <split [7911/1978]> Fold4
5 <split [7912/1977]> Fold5
The Recipe
alcohol_recipe <-
recipe (formula = UsedAlcohol ~ ., data = alcohol_train) |>
step_impute_mode (all_nominal_predictors ()) |>
step_impute_mean (all_numeric_predictors ()) |>
step_dummy (all_nominal_predictors ())
The Specification
ranger_spec <-
rand_forest (
# the number of predictors to sample at each split
mtry = tune (),
# the number of observations needed to keep splitting nodes
min_n = tune (),
trees = 100 ) |>
set_mode ("classification" ) |>
set_engine ("ranger" ,
# This is essential for vip()
importance = "permutation" )
ranger_spec
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
The Workflow
ranger_workflow <-
workflow () |>
add_recipe (alcohol_recipe) |>
add_model (ranger_spec)
ranger_workflow
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: rand_forest()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
Tuning
doParallel:: registerDoParallel ()
set.seed (46257 )
ranger_tune <-
tune_grid (
ranger_workflow,
resamples = cv_alcohol,
# grid = 11 says to choose 11 parameter sets automatically
grid = 11 )
doParallel:: stopImplicitCluster ()
Collect the tunning Metrics
collect_metrics (ranger_tune)
# A tibble: 22 Γ 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 46 28 accuracy binary 0.793 5 0.00237 Preprocessor1_Model01
2 46 28 roc_auc binary 0.859 5 0.00206 Preprocessor1_Model01
3 22 7 accuracy binary 0.789 5 0.00381 Preprocessor1_Model02
4 22 7 roc_auc binary 0.857 5 0.00155 Preprocessor1_Model02
5 18 15 accuracy binary 0.794 5 0.00288 Preprocessor1_Model03
6 18 15 roc_auc binary 0.861 5 0.00148 Preprocessor1_Model03
7 32 10 accuracy binary 0.791 5 0.00266 Preprocessor1_Model04
8 32 10 roc_auc binary 0.856 5 0.00138 Preprocessor1_Model04
9 57 18 accuracy binary 0.794 5 0.00253 Preprocessor1_Model05
10 57 18 roc_auc binary 0.856 5 0.00189 Preprocessor1_Model05
# βΉ 12 more rows
Select best hyperparameters
best <- select_best (ranger_tune, metric = "roc_auc" )
best
# A tibble: 1 Γ 3
mtry min_n .config
<int> <int> <chr>
1 2 39 Preprocessor1_Model11
Finalize the Workflow
final_wf <- finalize_workflow (ranger_workflow, best)
final_wf
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: rand_forest()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Random Forest Model Specification (classification)
Main Arguments:
mtry = 2
trees = 100
min_n = 39
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
Fit the model in the training set
alcohol_fit <- fit (final_wf, alcohol_train)
alcohol_fit
Checking predictions in the training set
alcohol_pred <-
augment (alcohol_fit, alcohol_train) |>
select (UsedAlcohol, .pred_class, .pred_1, .pred_0)
alcohol_pred
# A tibble: 9,889 Γ 4
UsedAlcohol .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 1 0.654 0.346
2 0 0 0.265 0.735
3 0 0 0.215 0.785
4 0 0 0.213 0.787
5 0 1 0.848 0.152
6 0 1 0.661 0.339
7 0 0 0.314 0.686
8 0 0 0.304 0.696
9 0 0 0.412 0.588
10 0 0 0.205 0.795
# βΉ 9,879 more rows
Can it get better?
XGBoost
short for βExtreme Gradient Boosting,β
It belongs to the family of boosting algorithms, which means it combines multiple weaker models.
Works by iteratively building decision trees and then combining them in a smart way.
It focuses on reducing errors by analyzing the residuals (the differences between predicted and actual values) at each iteration and adjusting subsequent trees accordingly.
The Specification
xgb_spec <- boost_tree (
trees = 100 ,
tree_depth = tune (), min_n = tune (),
loss_reduction = tune (),
sample_size = tune (), mtry = tune (),
learn_rate = tune ()
) |>
set_engine ("xgboost" ) |>
set_mode ("classification" )
xgb_spec
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
The Workflow
xgb_wf <-
workflow () |>
add_recipe (alcohol_recipe) |>
add_model (xgb_spec)
xgb_wf
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: boost_tree()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
Tuning
doParallel:: registerDoParallel ()
set.seed (234 )
xgb_res <- tune_grid (
xgb_wf,
resamples = cv_alcohol,
grid = 11 ,
control = control_grid (save_pred = TRUE )
)
doParallel:: stopImplicitCluster ()
xgb_res
Collect the tunning Metrics
# A tibble: 22 Γ 12
mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
<int> <int> <int> <dbl> <dbl> <dbl> <chr>
1 42 5 11 0.00124 0.000000157 0.349 accuracy
2 42 5 11 0.00124 0.000000157 0.349 roc_auc
3 39 15 9 0.0744 0.00237 0.819 accuracy
4 39 15 9 0.0744 0.00237 0.819 roc_auc
5 20 19 2 0.0374 0.00000612 0.705 accuracy
6 20 19 2 0.0374 0.00000612 0.705 roc_auc
7 31 8 5 0.0190 0.0000403 0.224 accuracy
8 31 8 5 0.0190 0.0000403 0.224 roc_auc
9 3 38 7 0.00414 0.00000000990 0.585 accuracy
10 3 38 7 0.00414 0.00000000990 0.585 roc_auc
# βΉ 12 more rows
# βΉ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
# .config <chr>
Select best hyperparameters
best_xg <- select_best (xgb_res, metric = "roc_auc" )
best_xg
# A tibble: 1 Γ 7
mtry min_n tree_depth learn_rate loss_reduction sample_size .config
<int> <int> <int> <dbl> <dbl> <dbl> <chr>
1 53 29 5 0.180 0.250 0.489 Preprocessor1_Moβ¦
Finalize the Workflow
final_wf_xg <- finalize_workflow (xgb_wf, best_xg)
final_wf_xg
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: boost_tree()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 53
trees = 100
min_n = 29
tree_depth = 5
learn_rate = 0.180162087183753
loss_reduction = 0.249654965980625
sample_size = 0.489060359198431
Computational engine: xgboost
Fit the model in the training set
alcohol_fit_xg <- fit (final_wf_xg, alcohol_train)
alcohol_fit_xg
Checking predictions in the training set
alcohol_pred_xg <-
augment (alcohol_fit_xg, alcohol_train) |>
select (UsedAlcohol, .pred_class, .pred_1, .pred_0)
alcohol_pred_xg
# A tibble: 9,889 Γ 4
UsedAlcohol .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 1 0.692 0.308
2 0 0 0.176 0.824
3 0 0 0.118 0.882
4 0 0 0.117 0.883
5 0 1 0.937 0.0627
6 0 1 0.697 0.303
7 0 0 0.184 0.816
8 0 0 0.308 0.692
9 0 0 0.285 0.715
10 0 0 0.142 0.858
# βΉ 9,879 more rows
Letβs compare the perfomance
Last fit in Random Forest
alcohol_last_fit <-
last_fit (final_wf,
split = alcohol_split,
metrics = metric_set (kap, roc_auc, sens, spec))
alcohol_last_fit
Metrics for Random Forest in Testing Data
collect_metrics (alcohol_last_fit)
# A tibble: 4 Γ 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 kap binary 0.593 Preprocessor1_Model1
2 sens binary 0.793 Preprocessor1_Model1
3 spec binary 0.803 Preprocessor1_Model1
4 roc_auc binary 0.870 Preprocessor1_Model1
Predictions in the testing set
predictions_testing <-
alcohol_last_fit |>
collect_predictions () |>
select (- .config, - .row)
predictions_testing
# A tibble: 3,297 Γ 5
id .pred_class .pred_0 .pred_1 UsedAlcohol
<chr> <fct> <dbl> <dbl> <fct>
1 train/test split 0 0.688 0.312 0
2 train/test split 0 0.529 0.471 1
3 train/test split 1 0.487 0.513 0
4 train/test split 1 0.246 0.754 1
5 train/test split 1 0.0676 0.932 1
6 train/test split 1 0.173 0.827 1
7 train/test split 0 0.652 0.348 1
8 train/test split 0 0.642 0.358 0
9 train/test split 1 0.0782 0.922 1
10 train/test split 1 0.0247 0.975 1
# βΉ 3,287 more rows
Confusion Matrix in the testing set
conf_mat_test <-
predictions_testing |>
conf_mat (UsedAlcohol, .pred_class) |>
autoplot (type = "heatmap" )
conf_mat_test
Make sure your metrics are interpretable
collect_metrics (alcohol_last_fit) |>
select (- .estimator, - .config)
# A tibble: 4 Γ 2
.metric .estimate
<chr> <dbl>
1 kap 0.593
2 sens 0.793
3 spec 0.803
4 roc_auc 0.870
Make sure your metrics are interpretable
collect_metrics (alcohol_last_fit) |>
select (- .estimator, - .config)
# A tibble: 4 Γ 2
.metric .estimate
<chr> <dbl>
1 kap 0.593
2 sens 0.793
3 spec 0.803
4 roc_auc 0.870
Make sure your metrics are interpretable
multi_metric <- metric_set (sens, spec, accuracy, kap)
multi_metric (predictions_testing,
truth = UsedAlcohol,
estimate = .pred_class,
event_level = "second" )
# A tibble: 4 Γ 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.803
2 spec binary 0.793
3 accuracy binary 0.799
4 kap binary 0.593
Compare ROC-AUC in training and testing
Variable Importance Plot
library (vip)
alcohol_last_fit |>
extract_fit_parsnip () |>
vi () |>
slice_max (Importance, n = 10 ) |>
ggplot (aes (Importance, fct_reorder (Variable, Importance))) +
geom_col () +
labs (y = NULL ) +
theme (legend.position = "none" )
We did it!