Regularization Techniques: Ridge, Lasso, and Elastic Net Using `tidymodels`

Catalina Canizares

Agenda

Review from last class
Understand shrinkage
Review the Residual Sum of Squares concept
Overview of Ridge Regression
Overview of Lasso
Elastic Net
An example using tidymodels

Objective

By the end of the class, students will have:

A solid understanding of L1 and L2 penalties.
The ability to perform complete Lasso and Ridge analyses using the Tidy Models syntax.

Review of what we have covered

Click here

Shrinkage

Shrinkage is a technique that “shrinks” or reduces the coefficients of predictor variables towards zero.
Shrinkage methods strike a balance between capturing the relationships in the data and avoiding overfitting.
Shrinking the coefficient estimates can significantly reduce variance.
The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

Residual Sum of Squares (RSS)

The RSS measures the amount of error remaining between the regression function and the data set after the model has been run.
A smaller RSS represents a regression function that is well-fit to the data.

\[ \text{RSS} = \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 \]

Ridge Regression

Ridge

Ridge regression is a regularization technique that adds a penalty term (\(\lambda\)) to the RSS.
It shrinks the regression coefficients towards zero, reducing their impact on the model.
It is also known as the L2 regularization
The formula for Ridge regression is:

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

Ridge

When \(\lambda\) is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors.

Lasso

Lasso (Least Absolute Shrinkage and Selection Operator) is another regularization technique that adds a penalty term (\(\lambda\)) to the RSS
It has a built-in feature selection property, as it can shrink coefficients to exactly zero.
It is also known as the L1 regularization

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p | \beta_j| \]

Lasso

When λ = 0, then the lasso simply gives the least squares fit, and when \(\lambda\) becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.

Compare Ridge vs. Lasso

Depending on the value of \(\lambda\), the lasso can produce a model involving any number of variables.

Ridge will always include all of the variables in the model, although the magnitude of the coefficient estimates will depend on \(\lambda\)

The Variable Selection Property of the Lasso

-The fact that some lasso coefficients are shrunken entirely to zero explains why the lasso performs feature selection

Elastic Net

Is a generalization of lasso and ridge regression
It combines the two penalties.
The formula for Elastic Net:

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p \beta_j^2 + \lambda \sum_{j=1}^p | \beta_j| \]

Elastic Net

The advantage of the elastic net is that it keeps the feature selection quality from the lasso penalty as well as the effectiveness of the ridge penalty.

How do we choose the best \(\lambda\)?

Using resampling methods

Check the perfomance using RMSE or ROC- AUC

We can check from no shrinkage to so much that we don’t have predictors left (lasso)

Lasso with `tidymodels`

Task

Determine the key predictors from a comprehensive set of 42 risky behaviors, encompassing alcohol and drug use, as well as reckless activities, with the primary objective of accurately predicting the probability of engaging in texting and driving.

The Data - `riskyBehaviors`

library(MLearnYRBSS)
data("riskyBehaviors")

Data Cleaning

riskyBehaviors_analysis <- 
riskyBehaviors |> 
  mutate(
    TextingDriving = case_when(
      TextingDriving == 1 ~ 0, 
      TextingDriving %in% c(2, 3, 4, 5) ~ 1, 
      TRUE ~ NA
  )) |> 
  mutate(TextingDriving = factor(TextingDriving)) |> 
  drop_na(TextingDriving) |> 
  select(-SourceVaping, -SourceAlcohol, -SexOrientation, -DrivingDrinking,
         -SexualAbuse, -SexualAbuseByPartner, -Grade, -Age)

Split Training - Testing

set.seed(1990)

analysis_split <- initial_split(riskyBehaviors_analysis,
                                stratum = TextingDriving)

analysis_train <- training(analysis_split)
analysis_test <- testing(analysis_split)

analysis_split

<Training/Testing/Total>
<8170/2724/10894>

Lets Check Our Work

analysis_train |> 
  tabyl(TextingDriving)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()

 TextingDriving    n percent
              0 3666     45%
              1 4504     55%
          Total 8170       -

analysis_test |>  
  tabyl(TextingDriving)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()

 TextingDriving    n percent
              0 1207     44%
              1 1517     56%
          Total 2724       -

Crossvalidation

set.seed(1990)

analysis_folds <- vfold_cv(analysis_train, v = 5) 
analysis_folds

#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [6536/1634]> Fold1
2 <split [6536/1634]> Fold2
3 <split [6536/1634]> Fold3
4 <split [6536/1634]> Fold4
5 <split [6536/1634]> Fold5

Recipe

texting_rec <- 
  recipe(TextingDriving ~ ., data = analysis_train) |> 
  step_zv(all_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |> 
  step_impute_mean(all_numeric_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors()) 

texting_rec

Specification

texting_spec <-
  logistic_reg(penalty = tune(), 
               mixture = 1) |> 
# mixture = 1 specifies a pure lasso model,
# mixture = 0 specifies a ridge regression model, and
# 0 < mixture < 1 specifies an elastic net model, interpolating lasso and ridge.
# https://parsnip.tidymodels.org/reference/glmnet-details.html
  set_engine('glmnet')

texting_spec

Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet

Workflow

texting_wf <- 
  workflow() |> 
  add_recipe(texting_rec) |> 
  add_model(texting_spec)

texting_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet

The Grid

A Grid search systematically explores different penalty values (e.g., 0.001, 0.01, 0.1, 1, 10).
Model performance is evaluated for each combination of penalty values.

The Grid

# Used for Dr. Balise Class
glmnet_grid <- data.frame(penalty = 10^seq(-6, -1, length.out = 20))
glmnet_grid

        penalty
1  1.000000e-06
2  1.832981e-06
3  3.359818e-06
4  6.158482e-06
5  1.128838e-05
6  2.069138e-05
7  3.792690e-05
8  6.951928e-05
9  1.274275e-04
10 2.335721e-04
11 4.281332e-04
12 7.847600e-04
13 1.438450e-03
14 2.636651e-03
15 4.832930e-03
16 8.858668e-03
17 1.623777e-02
18 2.976351e-02
19 5.455595e-02
20 1.000000e-01

# I made it up
penalty_grid <- grid_regular(penalty(range = c(-4, 4)), levels = 50)
penalty_grid

# A tibble: 50 × 1
    penalty
      <dbl>
 1 0.0001  
 2 0.000146
 3 0.000212
 4 0.000309
 5 0.000450
 6 0.000655
 7 0.000954
 8 0.00139 
 9 0.00202 
10 0.00295 
# ℹ 40 more rows

# What I would recommend 
lambda_grid <- grid_regular(penalty(), levels = 50)
lambda_grid

# A tibble: 50 × 1
    penalty
      <dbl>
 1 1   e-10
 2 1.60e-10
 3 2.56e-10
 4 4.09e-10
 5 6.55e-10
 6 1.05e- 9
 7 1.68e- 9
 8 2.68e- 9
 9 4.29e- 9
10 6.87e- 9
# ℹ 40 more rows

How to determine the grid?

Tune the Grid Using the Workflow

doParallel::registerDoParallel()

set.seed(2023)

lasso_tune <- 
  tune_grid(
  object = texting_wf, 
  resamples = analysis_folds,
  grid = lambda_grid
)

doParallel::stopImplicitCluster()

Collect the Metrics

lasso_tune |> 
  collect_metrics()

# A tibble: 100 × 7
    penalty .metric  .estimator  mean     n std_err .config              
      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1 1   e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model01
 2 1   e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model01
 3 1.60e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model02
 4 1.60e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model02
 5 2.56e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model03
 6 2.56e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model03
 7 4.09e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model04
 8 4.09e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model04
 9 6.55e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model05
10 6.55e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model05
# ℹ 90 more rows

autoplot(lasso_tune)

lasso_tune |> 
  collect_metrics() |> 
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_line(linewidth = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  theme(legend.position = "none")

Choosing our final parameter

best <- lasso_tune |> 
  select_best("roc_auc")

best

# A tibble: 1 × 2
   penalty .config              
     <dbl> <chr>                
1 0.000543 Preprocessor1_Model34

Finalize the Workflow

final_wf <- finalize_workflow(texting_wf, best)

final_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.000542867543932386
  mixture = 1

Computational engine: glmnet

Lets Fit!!

texting_fit <- 
  fit(final_wf, data = analysis_train)

texting_fit

══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial",      alpha = ~1) 

   Df %Dev   Lambda
1   0 0.00 0.065350
2   1 0.21 0.059540
3   1 0.39 0.054250
4   1 0.54 0.049430
5   1 0.66 0.045040
6   1 0.76 0.041040
7   1 0.85 0.037390
8   2 0.93 0.034070
9   4 1.07 0.031040
10  5 1.21 0.028290
11  7 1.36 0.025770
12  9 1.53 0.023480
13 10 1.69 0.021400
14 11 1.83 0.019500
15 11 1.95 0.017760
16 12 2.06 0.016190
17 14 2.15 0.014750
18 14 2.24 0.013440
19 15 2.33 0.012240
20 17 2.41 0.011160
21 20 2.50 0.010170
22 21 2.59 0.009263
23 24 2.67 0.008440
24 24 2.75 0.007690
25 26 2.82 0.007007
26 30 2.91 0.006384
27 31 3.00 0.005817
28 32 3.08 0.005300
29 32 3.14 0.004830
30 36 3.20 0.004401
31 38 3.25 0.004010
32 39 3.30 0.003653
33 39 3.35 0.003329
34 40 3.40 0.003033
35 43 3.43 0.002764
36 43 3.47 0.002518
37 43 3.50 0.002294
38 43 3.53 0.002091
39 43 3.55 0.001905
40 44 3.57 0.001736
41 44 3.58 0.001581
42 44 3.60 0.001441
43 45 3.61 0.001313
44 45 3.62 0.001196
45 45 3.62 0.001090
46 45 3.63 0.000993

...
and 8 more lines.

Review fit on the training data

texting_pred <- 
  augment(texting_fit, analysis_train) |> 
  select(TextingDriving, .pred_class, .pred_1, .pred_0)

texting_pred

# A tibble: 8,170 × 4
   TextingDriving .pred_class .pred_1 .pred_0
   <fct>          <fct>         <dbl>   <dbl>
 1 0              1             0.510   0.490
 2 1              0             0.437   0.563
 3 0              0             0.267   0.733
 4 1              0             0.399   0.601
 5 1              1             0.505   0.495
 6 1              1             0.615   0.385
 7 1              1             0.548   0.452
 8 1              0             0.460   0.540
 9 1              1             0.519   0.481
10 1              1             0.602   0.398
# ℹ 8,160 more rows

ROC plot and AUC value

roc_plot_training <- 
  texting_pred |> 
  roc_curve(truth = TextingDriving, .pred_1, event_level = "second") |> 
  autoplot()

roc_plot_training

texting_pred |> 
  roc_auc(truth = TextingDriving,.pred_1, event_level = "second")

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.622

Ckeck the estimates

texting_fit |> 
  extract_fit_parsnip() |> 
  tidy()

# A tibble: 51 × 3
   term                 estimate  penalty
   <chr>                   <dbl>    <dbl>
 1 (Intercept)           0.469   0.000543
 2 DrinkingDriver       -0.0224  0.000543
 3 WeaponCarrying        0.0593  0.000543
 4 WeaponCarryingSchool  0.0666  0.000543
 5 GunCarrying           0.0995  0.000543
 6 UnsafeAtSchool        0.0439  0.000543
 7 InjuredInSchool       0.00376 0.000543
 8 PhysicalFight        -0.0571  0.000543
 9 SchoolPhysicalFight  -0.0243  0.000543
10 TimesSexualAbuse      0.0484  0.000543
# ℹ 41 more rows

Lets fit on the TEST Data

texting_last_fit <- 
  last_fit(final_wf,
           split = analysis_split,
           metrics = metric_set(accuracy, kap, roc_auc))

texting_last_fit

# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits              id               .metrics .notes   .predictions .workflow 
  <list>              <chr>            <list>   <list>   <list>       <list>    
1 <split [8170/2724]> train/test split <tibble> <tibble> <tibble>     <workflow>

Collect the Metrics

collect_metrics(texting_last_fit)

# A tibble: 3 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.569 Preprocessor1_Model1
2 kap      binary         0.105 Preprocessor1_Model1
3 roc_auc  binary         0.599 Preprocessor1_Model1

Review Prediction on the Test Data with `yardstick`

the_prediction <-  
  texting_last_fit |> 
  collect_predictions() 
  

the_prediction

# A tibble: 2,724 × 7
   id               .pred_class  .row .pred_0 .pred_1 TextingDriving .config    
   <chr>            <fct>       <int>   <dbl>   <dbl> <fct>          <chr>      
 1 train/test split 0               3   0.589   0.411 1              Preprocess…
 2 train/test split 0               6   0.527   0.473 0              Preprocess…
 3 train/test split 1               8   0.492   0.508 0              Preprocess…
 4 train/test split 1              13   0.417   0.583 1              Preprocess…
 5 train/test split 1              20   0.460   0.540 0              Preprocess…
 6 train/test split 0              24   0.545   0.455 1              Preprocess…
 7 train/test split 1              28   0.305   0.695 1              Preprocess…
 8 train/test split 1              29   0.411   0.589 0              Preprocess…
 9 train/test split 1              30   0.221   0.779 1              Preprocess…
10 train/test split 1              32   0.456   0.544 0              Preprocess…
# ℹ 2,714 more rows

Review Prediction on the Test Data with `yardstick`

multi_metric <- metric_set(sens, spec, accuracy, kap)

multi_metric(the_prediction, 
             truth = TextingDriving, 
             estimate = .pred_class, 
             event_level = "second")

# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.713
2 spec     binary         0.389
3 accuracy binary         0.569
4 kap      binary         0.105

Confusion Matrix on the Test Data

ROC curve on the Test Data

Variable Importance Plot

library(vip)

texting_last_fit |> 
  extract_fit_engine() |> 
  vi() |> 
  group_by(Sign) |> 
  slice_max(Importance, n = 5) |> 
  ungroup() |> 
  ggplot(aes(Importance, fct_reorder(Variable, Importance), fill = Sign)) + 
  geom_col() +
  facet_wrap(vars(Sign), scales = "free_y") +
  labs(y = NULL) +
  theme(legend.position = "none")

We did it!

Review Questions

Is the Residual Sum of Squares a relevant concept for L1 and L2 penalties? why?

L2 regularization would be better of if we called it the squared regularization?

If I want to reduce the dimensionality of a dataset I would use the _______ penalty?

If I want tidymodels to perform a ridge regression what should I change in this code?

review_spec <-
  logistic_reg(penalty = tune(), 
               mixture = tune()) |> 
  set_engine('glmnet')

BONUS: stepwise vs. lasso

Regularization Techniques: Ridge, Lasso, and Elastic Net Using tidymodels

Agenda

Objective

Review of what we have covered

Shrinkage

Residual Sum of Squares (RSS)

Ridge Regression

Ridge

Ridge

Lasso

Lasso

Lasso

Compare Ridge vs. Lasso

The Variable Selection Property of the Lasso

Elastic Net

Elastic Net

How do we choose the best \(\lambda\)?

Lasso with tidymodels

Task

The Data - riskyBehaviors

Data Cleaning

Split Training - Testing

Lets Check Our Work

Crossvalidation

Recipe

Specification

Workflow

The Grid

The Grid

How to determine the grid?

How to determine the grid?

Tune the Grid Using the Workflow

Collect the Metrics

Visualize the Metrics

Choosing our final parameter

Finalize the Workflow

Lets Fit!!

Review fit on the training data

ROC plot and AUC value

Ckeck the estimates

Lets fit on the TEST Data

Collect the Metrics

Review Prediction on the Test Data with yardstick

Review Prediction on the Test Data with yardstick

Confusion Matrix on the Test Data

ROC curve on the Test Data

Variable Importance Plot

We did it!

Review Questions

Thank you!

Regularization Techniques: Ridge, Lasso, and Elastic Net Using `tidymodels`

Lasso with `tidymodels`

The Data - `riskyBehaviors`

Review Prediction on the Test Data with `yardstick`

Review Prediction on the Test Data with `yardstick`