tidymodels
Catalina Canizares
๐ฒ Decision Trees
๐ฒ Basic concepts (Root, Feature, Leaf)
๐ฒ View and understand the splits
๐ฒ Entropy and Information Gain
๐ฒ Early stopping and Pruning
๐ฒ A tree in tidymodels
๐ฒ Decision Trees are widely used algorithms for supervised machine learning.
๐ฒ They provide interpretable models for making predictions in both regression and classification tasks.
๐ฒ Consists of a series of sequential decisions on some data setโs features.
๐ฒ Instead of minimizing the Sum of Squared Errors, you can minimize entropyโฆ.
๐ฒ Entropy = measures the amount of information of some variable or event.
๐ฒ Weโll make use of it to identify regions consisting of
a large number of similar (pure) or
dissimilar (impure) elements.
๐ฒ A dataset of only pink dots would have very low (in fact, zero) entropy.
๐ฒ A dataset of mixed pink and green would have relatively high entropy.
๐ฒ Measures the quality of a split
๐ฒ The core algorithm to calculate information gain is called ID3.
๐ฒ It is calculated for a split by subtracting the weighted entropies of each branch from the original entropy. When training a Decision Tree using these metrics, the best split is chosen by maximizing Information Gain.
๐ฒ Select the split that yields the largest reduction in entropy, or, the largest increase in information.
๐ฒ Letโs look at it Simulation
If you are intersted in the math: A Simple Explanation of Information Gain and Entropy
๐ฒ One of the questions that arises in a decision tree algorithm is: what is the optimal size of the final tree
๐ฒ A tree that is too large risks over-fitting the training data and poorly generalizing to new samples.
๐ฒ A small tree might not capture important structural information about the sample space.
๐ฒ However, it is hard to tell when a tree algorithm should stop!
๐ฒ Cap the maximum tree depth.
๐ฒ A method to stop the tree early.
๐ฒ Used to prevent overfitting.
๐ฒ An integer for the minimum number of data points in a node that are required for the node to be split further.
๐ฒ Set minimum n to split at any node.
๐ฒ Another early stopping method.
๐ฒ Used to prevent overfitting.
๐ฒ min_n = 1 would lead to the most overfit tree.
๐ฒ Adds a cost or penalty to error rates of more complex trees
๐ฒ Used to prevent overfitting.
๐ฒ Closer to zero โก๏ธ larger trees.
๐ฒ Higher penalty โก๏ธ smaller trees.
\[ R_\alpha(T) = R(T) + \alpha|\widetilde{T}| \]
๐ฒ \(R(T)\) misclassification rate
๐ฒ For any subtree \(T<T_{max}\) we will define its complexity as \(|\widetilde{T}|\)
๐ฒ \(|\widetilde{T}|\) = the number of terminal or leaf nodes in T.
๐ฒ \(\alpha โค0\) be a real number called the complexity parameter.
๐ฒ If \(\alpha\) = 0 then the biggest tree will be chosen because the complexity penalty term is essentially dropped.
๐ฒ As \(\alpha\) approaches infinity, the tree of size 1, will be selected.
tidymodels
Predict whether an adolescent has consumed alcohol or not based on a set of various risk behaviors.
data("riskyBehaviors")
riskyBehaviors_analysis <-
riskyBehaviors |>
mutate(UsedAlcohol = case_when(
AgeFirstAlcohol == 1 ~ 0,
AgeFirstAlcohol %in% c(2, 3, 5, 6, 4, 7) ~ 1,
TRUE ~ NA
)) |>
mutate(UsedAlcohol = factor(UsedAlcohol)) |>
drop_na(UsedAlcohol) |>
select(- c(AgeFirstAlcohol, DaysAlcohol, BingeDrinking, LargestNumberOfDrinks, SourceAlcohol, SourceAlcohol))
<Training/Testing/Total>
<9889/3297/13186>
# 10-fold cross-validation using stratification
# A tibble: 10 ร 2
splits id
<list> <chr>
1 <split [8899/990]> Fold01
2 <split [8899/990]> Fold02
3 <split [8899/990]> Fold03
4 <split [8899/990]> Fold04
5 <split [8900/989]> Fold05
6 <split [8901/988]> Fold06
7 <split [8901/988]> Fold07
8 <split [8901/988]> Fold08
9 <split [8901/988]> Fold09
10 <split [8901/988]> Fold10
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = tune()
tree_depth = tune()
min_n = tune()
Computational engine: rpart
โโ Workflow โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Preprocessor: Recipe
Model: decision_tree()
โโ Preprocessor โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 Recipe Steps
โข step_impute_mode()
โข step_impute_mean()
โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = tune()
tree_depth = tune()
min_n = tune()
Computational engine: rpart
# A tibble: 64 ร 3
cost_complexity tree_depth min_n
<dbl> <int> <int>
1 0.0000000001 2 2
2 0.0000001 2 2
3 0.0001 2 2
4 0.1 2 2
5 0.0000000001 3 2
6 0.0000001 3 2
7 0.0001 3 2
8 0.1 3 2
9 0.0000000001 4 2
10 0.0000001 4 2
# โน 54 more rows
# A tibble: 5 ร 9
cost_complexity tree_depth min_n .metric .estimator mean n std_err
<dbl> <int> <int> <chr> <chr> <dbl> <int> <dbl>
1 0.0000000001 5 2 roc_auc binary 0.834 10 0.00520
2 0.0000001 5 2 roc_auc binary 0.834 10 0.00520
3 0.0000000001 5 40 roc_auc binary 0.834 10 0.00484
4 0.0000001 5 40 roc_auc binary 0.834 10 0.00484
5 0.0001 5 40 roc_auc binary 0.834 10 0.00484
# โน 1 more variable: .config <chr>
# A tibble: 1 ร 4
cost_complexity tree_depth min_n .config
<dbl> <int> <int> <chr>
1 0.0000000001 5 2 Preprocessor1_Model13
โโ Workflow โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Preprocessor: Recipe
Model: decision_tree()
โโ Preprocessor โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 Recipe Steps
โข step_impute_mode()
โข step_impute_mean()
โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Decision Tree Model Specification (classification)
Main Arguments:
cost_complexity = 1e-10
tree_depth = 5
min_n = 2
Computational engine: rpart
โโ Workflow [trained] โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Preprocessor: Recipe
Model: decision_tree()
โโ Preprocessor โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
2 Recipe Steps
โข step_impute_mode()
โข step_impute_mean()
โโ Model โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
n= 9889
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 9889 4354 1 (0.44028719 0.55971281)
2) Vaping=0 5109 1608 0 (0.68526130 0.31473870)
4) AgeFirstMarihuana< 1.5 4397 1086 0 (0.75301342 0.24698658)
8) AgeFirstCig< 2.5 4282 1003 0 (0.76576366 0.23423634)
16) TextingDriving< 2.659047 3911 844 0 (0.78419841 0.21580159) *
17) TextingDriving>=2.659047 371 159 0 (0.57142857 0.42857143)
34) DrivingDrinking< 2.5 363 151 0 (0.58402204 0.41597796) *
35) DrivingDrinking>=2.5 8 0 1 (0.00000000 1.00000000) *
9) AgeFirstCig>=2.5 115 32 1 (0.27826087 0.72173913)
18) Grade=12,9 53 21 1 (0.39622642 0.60377358)
36) Race=Hispanic/Latino,Native Hawaiian/Other PI 5 1 0 (0.80000000 0.20000000) *
37) Race=Asian,Black or African American,Multiple-Hispanic,Multiple-Non-Hispanic,White 48 17 1 (0.35416667 0.64583333) *
19) Grade=10,11 62 11 1 (0.17741935 0.82258065) *
5) AgeFirstMarihuana>=1.5 712 190 1 (0.26685393 0.73314607)
10) AgeFirstCig< 1.423554 448 162 1 (0.36160714 0.63839286)
20) AgeFirstMarihuana< 2.809703 59 24 0 (0.59322034 0.40677966)
40) TextingDriving< 2.659047 52 17 0 (0.67307692 0.32692308) *
41) TextingDriving>=2.659047 7 0 1 (0.00000000 1.00000000) *
21) AgeFirstMarihuana>=2.809703 389 127 1 (0.32647815 0.67352185) *
11) AgeFirstCig>=1.423554 264 28 1 (0.10606061 0.89393939) *
3) Vaping=1 4780 853 1 (0.17845188 0.82154812)
6) AgeFirstMarihuana< 1.5 1658 556 1 (0.33534379 0.66465621)
12) SourceVaping=1 937 396 1 (0.42262540 0.57737460)
24) SexualAbuseByPartner< 1.884927 450 220 0 (0.51111111 0.48888889)
48) AgeFirstCig< 5.5 424 198 0 (0.53301887 0.46698113) *
49) AgeFirstCig>=5.5 26 4 1 (0.15384615 0.84615385) *
25) SexualAbuseByPartner>=1.884927 487 166 1 (0.34086242 0.65913758) *
13) SourceVaping=2,3,4,5,6,7,8 721 160 1 (0.22191401 0.77808599) *
7) AgeFirstMarihuana>=1.5 3122 297 1 (0.09513133 0.90486867) *
# A tibble: 9,889 ร 4
UsedAlcohol .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 1 0.905 0.0951
2 0 0 0.216 0.784
3 0 0 0.216 0.784
4 0 0 0.216 0.784
5 0 1 0.905 0.0951
6 0 0 0.467 0.533
7 0 0 0.216 0.784
8 0 0 0.216 0.784
9 0 0 0.416 0.584
10 0 0 0.216 0.784
# โน 9,879 more rows
This model has not been tested yet, as we are planning to conduct an additional analysis. In the next presentation, we will utilize the same training data with the Random Forest algorithm, followed by evaluating its performance using the testing set.