tidymodels
π² Developing the intuition
π² Random Forest
π² Bagging
π² Random forest in tidymodels
π² It gets better with XGboost
π² Comparing and choosing a model
π² Final fit
Listen carefully to the group you will be assigned to
DO NOT look for the correct answer
INDIVIDUALLY
How many people attended the last Taylor Swift concert at Pittsburgh stadium during Eras Tour concert?
π³ If each person is more than 50% correct, then adding more people to vote increases the probability that the majority is correct.
π³ The theorem suggests that the probability of the majority vote being correct can go up as we add more and more models
π³ Random Forest is just a bunch of Decision Trees bundled together.
π³ The idea is if we have a βweakβ algorithm like a decision tree, if we make a lot of different models using this weak algorithm and average the result of their prediction, then the final result will be much better.
π³ This is called Ensemble Learning
π΄ Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
π³ Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the chain.
π² Stacking. Building multiple models (typically of differing types) and supervisor model that learns how to best combine the predictions of the primary models.
π² One way to produce multiple models that are different is to train each model using a different training set.
π² The Bagging (Bootstrap Aggregating) method randomly draws a fixed number of samples from the training set with replacement.
π² The algorithm randomly samples people to build a tree and it also will randomly select variables to check when making a new node.
Select random samples from a given training set.
The algorithm will construct a decision tree for every training data
Voting will take place by averaging the decision tree
Select the most voted prediction result as the final model
We will build a random forest model to classify if a road sign is a pedestrian crossing sign or not.
Our features are: Size, number of sides, number of colors used, and if the sign has text or symbol.
tidymodels
Predict whether an adolescent has consumed alcohol or not based on a set of various risk behaviors.
data("riskyBehaviors")
riskyBehaviors_analysis <-
riskyBehaviors |>
mutate(UsedAlcohol = case_when(
AgeFirstAlcohol == 1 ~ 0,
AgeFirstAlcohol %in% c(2, 3, 5, 6, 4, 7) ~ 1,
TRUE ~ NA
)) |>
mutate(UsedAlcohol = factor(UsedAlcohol)) |>
drop_na(UsedAlcohol) |>
select(- c(AgeFirstAlcohol, DaysAlcohol, BingeDrinking, LargestNumberOfDrinks, SourceAlcohol, SourceAlcohol))
<Training/Testing/Total>
<9889/3297/13186>
# 5-fold cross-validation using stratification
# A tibble: 5 Γ 2
splits id
<list> <chr>
1 <split [7911/1978]> Fold1
2 <split [7911/1978]> Fold2
3 <split [7911/1978]> Fold3
4 <split [7911/1978]> Fold4
5 <split [7912/1977]> Fold5
ranger_spec <-
rand_forest(
# the number of predictors to sample at each split
mtry = tune(),
# the number of observations needed to keep splitting nodes
min_n = tune(),
trees = 100) |>
set_mode("classification") |>
set_engine("ranger",
# This is essential for vip()
importance = "permutation")
ranger_spec
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: rand_forest()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Random Forest Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
# A tibble: 22 Γ 8
mtry min_n .metric .estimator mean n std_err .config
<int> <int> <chr> <chr> <dbl> <int> <dbl> <chr>
1 46 28 accuracy binary 0.793 5 0.00237 Preprocessor1_Model01
2 46 28 roc_auc binary 0.859 5 0.00206 Preprocessor1_Model01
3 22 7 accuracy binary 0.789 5 0.00381 Preprocessor1_Model02
4 22 7 roc_auc binary 0.857 5 0.00155 Preprocessor1_Model02
5 18 15 accuracy binary 0.794 5 0.00288 Preprocessor1_Model03
6 18 15 roc_auc binary 0.861 5 0.00148 Preprocessor1_Model03
7 32 10 accuracy binary 0.791 5 0.00266 Preprocessor1_Model04
8 32 10 roc_auc binary 0.856 5 0.00138 Preprocessor1_Model04
9 57 18 accuracy binary 0.794 5 0.00253 Preprocessor1_Model05
10 57 18 roc_auc binary 0.856 5 0.00189 Preprocessor1_Model05
# βΉ 12 more rows
# A tibble: 1 Γ 3
mtry min_n .config
<int> <int> <chr>
1 2 39 Preprocessor1_Model11
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: rand_forest()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Random Forest Model Specification (classification)
Main Arguments:
mtry = 2
trees = 100
min_n = 39
Engine-Specific Arguments:
importance = permutation
Computational engine: ranger
# A tibble: 9,889 Γ 4
UsedAlcohol .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 1 0.654 0.346
2 0 0 0.265 0.735
3 0 0 0.215 0.785
4 0 0 0.213 0.787
5 0 1 0.848 0.152
6 0 1 0.661 0.339
7 0 0 0.314 0.686
8 0 0 0.304 0.696
9 0 0 0.412 0.588
10 0 0 0.205 0.795
# βΉ 9,879 more rows
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: boost_tree()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = tune()
trees = 100
min_n = tune()
tree_depth = tune()
learn_rate = tune()
loss_reduction = tune()
sample_size = tune()
Computational engine: xgboost
# A tibble: 22 Γ 12
mtry min_n tree_depth learn_rate loss_reduction sample_size .metric
<int> <int> <int> <dbl> <dbl> <dbl> <chr>
1 42 5 11 0.00124 0.000000157 0.349 accuracy
2 42 5 11 0.00124 0.000000157 0.349 roc_auc
3 39 15 9 0.0744 0.00237 0.819 accuracy
4 39 15 9 0.0744 0.00237 0.819 roc_auc
5 20 19 2 0.0374 0.00000612 0.705 accuracy
6 20 19 2 0.0374 0.00000612 0.705 roc_auc
7 31 8 5 0.0190 0.0000403 0.224 accuracy
8 31 8 5 0.0190 0.0000403 0.224 roc_auc
9 3 38 7 0.00414 0.00000000990 0.585 accuracy
10 3 38 7 0.00414 0.00000000990 0.585 roc_auc
# βΉ 12 more rows
# βΉ 5 more variables: .estimator <chr>, mean <dbl>, n <int>, std_err <dbl>,
# .config <chr>
# A tibble: 1 Γ 7
mtry min_n tree_depth learn_rate loss_reduction sample_size .config
<int> <int> <int> <dbl> <dbl> <dbl> <chr>
1 53 29 5 0.180 0.250 0.489 Preprocessor1_Moβ¦
ββ Workflow ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Preprocessor: Recipe
Model: boost_tree()
ββ Preprocessor ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
3 Recipe Steps
β’ step_impute_mode()
β’ step_impute_mean()
β’ step_dummy()
ββ Model βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Boosted Tree Model Specification (classification)
Main Arguments:
mtry = 53
trees = 100
min_n = 29
tree_depth = 5
learn_rate = 0.180162087183753
loss_reduction = 0.249654965980625
sample_size = 0.489060359198431
Computational engine: xgboost
# A tibble: 9,889 Γ 4
UsedAlcohol .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 1 0.692 0.308
2 0 0 0.176 0.824
3 0 0 0.118 0.882
4 0 0 0.117 0.883
5 0 1 0.937 0.0627
6 0 1 0.697 0.303
7 0 0 0.184 0.816
8 0 0 0.308 0.692
9 0 0 0.285 0.715
10 0 0 0.142 0.858
# βΉ 9,879 more rows
# A tibble: 4 Γ 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 kap binary 0.593 Preprocessor1_Model1
2 sens binary 0.793 Preprocessor1_Model1
3 spec binary 0.803 Preprocessor1_Model1
4 roc_auc binary 0.870 Preprocessor1_Model1
# A tibble: 3,297 Γ 7
id .pred_class .row .pred_0 .pred_1 UsedAlcohol .config
<chr> <fct> <int> <dbl> <dbl> <fct> <chr>
1 train/test split 0 1 0.688 0.312 0 Preprocessor1β¦
2 train/test split 0 6 0.529 0.471 1 Preprocessor1β¦
3 train/test split 1 8 0.487 0.513 0 Preprocessor1β¦
4 train/test split 1 9 0.246 0.754 1 Preprocessor1β¦
5 train/test split 1 15 0.0676 0.932 1 Preprocessor1β¦
6 train/test split 1 18 0.173 0.827 1 Preprocessor1β¦
7 train/test split 0 25 0.652 0.348 1 Preprocessor1β¦
8 train/test split 0 27 0.642 0.358 0 Preprocessor1β¦
9 train/test split 1 31 0.0782 0.922 1 Preprocessor1β¦
10 train/test split 1 37 0.0247 0.975 1 Preprocessor1β¦
# βΉ 3,287 more rows
# A tibble: 4 Γ 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.803
2 spec binary 0.793
3 accuracy binary 0.799
4 kap binary 0.593