Regularization Techniques: Ridge, Lasso, and Elastic Net Using tidymodels

Catalina Canizares

Agenda

  1. Review from last class
  2. Understand shrinkage
  3. Review the Residual Sum of Squares concept
  4. Overview of Ridge Regression
  5. Overview of Lasso
  6. Elastic Net
  7. An example using tidymodels

Objective

By the end of the class, students will have:

  1. A solid understanding of L1 and L2 penalties.

  2. The ability to perform complete Lasso and Ridge analyses using the Tidy Models syntax.

Review of what we have covered

Shrinkage

  • Shrinkage is a technique that “shrinks” or reduces the coefficients of predictor variables towards zero.
  • Shrinkage methods strike a balance between capturing the relationships in the data and avoiding overfitting.
  • Shrinking the coefficient estimates can significantly reduce variance.
  • The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso.

Residual Sum of Squares (RSS)

The RSS measures the amount of error remaining between the regression function and the data set after the model has been run.
A smaller RSS represents a regression function that is well-fit to the data.

\[ \text{RSS} = \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 \]

Ridge Regression

Ridge

  • Ridge regression is a regularization technique that adds a penalty term (\(\lambda\)) to the RSS.

  • It shrinks the regression coefficients towards zero, reducing their impact on the model.

  • It is also known as the L2 regularization

  • The formula for Ridge regression is:

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p \beta_j^2 \]

Ridge

When \(\lambda\) is extremely large, then all of the ridge coefficient estimates are basically zero; this corresponds to the null model that contains no predictors.

Lasso

Lasso

  • Lasso (Least Absolute Shrinkage and Selection Operator) is another regularization technique that adds a penalty term (\(\lambda\)) to the RSS

  • It has a built-in feature selection property, as it can shrink coefficients to exactly zero.

  • It is also known as the L1 regularization

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p | \beta_j| \]

Lasso

When λ = 0, then the lasso simply gives the least squares fit, and when \(\lambda\) becomes sufficiently large, the lasso gives the null model in which all coefficient estimates equal zero.

Compare Ridge vs. Lasso

Depending on the value of \(\lambda\), the lasso can produce a model involving any number of variables.

Ridge will always include all of the variables in the model, although the magnitude of the coefficient estimates will depend on \(\lambda\)

The Variable Selection Property of the Lasso

-The fact that some lasso coefficients are shrunken entirely to zero explains why the lasso performs feature selection

Elastic Net

  • Is a generalization of lasso and ridge regression
  • It combines the two penalties.
  • The formula for Elastic Net:

\[ \sum_{i=1}^n\bigg( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij}\bigg)^2 + \lambda \sum_{j=1}^p \beta_j^2 + \lambda \sum_{j=1}^p | \beta_j| \]

Elastic Net

The advantage of the elastic net is that it keeps the feature selection quality from the lasso penalty as well as the effectiveness of the ridge penalty.

How do we choose the best \(\lambda\)?

  1. Using resampling methods
  • Check the perfomance using RMSE or ROC- AUC
  1. We can check from no shrinkage to so much that we don’t have predictors left (lasso)

Lasso with tidymodels

Task

Determine the key predictors from a comprehensive set of 42 risky behaviors, encompassing alcohol and drug use, as well as reckless activities, with the primary objective of accurately predicting the probability of engaging in texting and driving.

The Data - riskyBehaviors

library(MLearnYRBSS)
data("riskyBehaviors")

Data Cleaning

riskyBehaviors_analysis <- 
riskyBehaviors |> 
  mutate(
    TextingDriving = case_when(
      TextingDriving == 1 ~ 0, 
      TextingDriving %in% c(2, 3, 4, 5) ~ 1, 
      TRUE ~ NA
  )) |> 
  mutate(TextingDriving = factor(TextingDriving)) |> 
  drop_na(TextingDriving) |> 
  select(-SourceVaping, -SourceAlcohol, -SexOrientation, -DrivingDrinking,
         -SexualAbuse, -SexualAbuseByPartner, -Grade, -Age) 

Split Training - Testing

set.seed(1990)

analysis_split <- initial_split(riskyBehaviors_analysis,
                                stratum = TextingDriving)

analysis_train <- training(analysis_split)
analysis_test <- testing(analysis_split)

analysis_split
<Training/Testing/Total>
<8170/2724/10894>

Lets Check Our Work

analysis_train |> 
  tabyl(TextingDriving)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()
 TextingDriving    n percent
              0 3666     45%
              1 4504     55%
          Total 8170       -
analysis_test |>  
  tabyl(TextingDriving)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()
 TextingDriving    n percent
              0 1207     44%
              1 1517     56%
          Total 2724       -

Crossvalidation

set.seed(1990)

analysis_folds <- vfold_cv(analysis_train, v = 5) 
analysis_folds
#  5-fold cross-validation 
# A tibble: 5 × 2
  splits              id   
  <list>              <chr>
1 <split [6536/1634]> Fold1
2 <split [6536/1634]> Fold2
3 <split [6536/1634]> Fold3
4 <split [6536/1634]> Fold4
5 <split [6536/1634]> Fold5

Recipe

texting_rec <- 
  recipe(TextingDriving ~ ., data = analysis_train) |> 
  step_zv(all_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |> 
  step_impute_mean(all_numeric_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors()) 

texting_rec 

Specification

texting_spec <-
  logistic_reg(penalty = tune(), 
               mixture = 1) |> 
# mixture = 1 specifies a pure lasso model,
# mixture = 0 specifies a ridge regression model, and
# 0 < mixture < 1 specifies an elastic net model, interpolating lasso and ridge.
# https://parsnip.tidymodels.org/reference/glmnet-details.html
  set_engine('glmnet')

texting_spec
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet 

Workflow

texting_wf <- 
  workflow() |> 
  add_recipe(texting_rec) |> 
  add_model(texting_spec)

texting_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = tune()
  mixture = 1

Computational engine: glmnet 

The Grid

  • A Grid search systematically explores different penalty values (e.g., 0.001, 0.01, 0.1, 1, 10).
  • Model performance is evaluated for each combination of penalty values.

The Grid

# Used for Dr. Balise Class
glmnet_grid <- data.frame(penalty = 10^seq(-6, -1, length.out = 20))
glmnet_grid
        penalty
1  1.000000e-06
2  1.832981e-06
3  3.359818e-06
4  6.158482e-06
5  1.128838e-05
6  2.069138e-05
7  3.792690e-05
8  6.951928e-05
9  1.274275e-04
10 2.335721e-04
11 4.281332e-04
12 7.847600e-04
13 1.438450e-03
14 2.636651e-03
15 4.832930e-03
16 8.858668e-03
17 1.623777e-02
18 2.976351e-02
19 5.455595e-02
20 1.000000e-01
# I made it up
penalty_grid <- grid_regular(penalty(range = c(-4, 4)), levels = 50)
penalty_grid
# A tibble: 50 × 1
    penalty
      <dbl>
 1 0.0001  
 2 0.000146
 3 0.000212
 4 0.000309
 5 0.000450
 6 0.000655
 7 0.000954
 8 0.00139 
 9 0.00202 
10 0.00295 
# ℹ 40 more rows
# What I would recommend 
lambda_grid <- grid_regular(penalty(), levels = 50)
lambda_grid
# A tibble: 50 × 1
    penalty
      <dbl>
 1 1   e-10
 2 1.60e-10
 3 2.56e-10
 4 4.09e-10
 5 6.55e-10
 6 1.05e- 9
 7 1.68e- 9
 8 2.68e- 9
 9 4.29e- 9
10 6.87e- 9
# ℹ 40 more rows

How to determine the grid?

How to determine the grid?

Tune the Grid Using the Workflow



doParallel::registerDoParallel()

set.seed(2023)

lasso_tune <- 
  tune_grid(
  object = texting_wf, 
  resamples = analysis_folds,
  grid = lambda_grid
)

doParallel::stopImplicitCluster()

Collect the Metrics



lasso_tune |> 
  collect_metrics()
# A tibble: 100 × 7
    penalty .metric  .estimator  mean     n std_err .config              
      <dbl> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1 1   e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model01
 2 1   e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model01
 3 1.60e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model02
 4 1.60e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model02
 5 2.56e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model03
 6 2.56e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model03
 7 4.09e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model04
 8 4.09e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model04
 9 6.55e-10 accuracy binary     0.579     5 0.00473 Preprocessor1_Model05
10 6.55e-10 roc_auc  binary     0.612     5 0.00547 Preprocessor1_Model05
# ℹ 90 more rows

Visualize the Metrics

autoplot(lasso_tune)

lasso_tune |> 
  collect_metrics() |> 
  ggplot(aes(penalty, mean, color = .metric)) +
  geom_line(linewidth = 1.5) +
  facet_wrap(~.metric, scales = "free", nrow = 2) +
  theme(legend.position = "none")

Choosing our final parameter



best <- lasso_tune |> 
  select_best("roc_auc")

best
# A tibble: 1 × 2
   penalty .config              
     <dbl> <chr>                
1 0.000543 Preprocessor1_Model34

Finalize the Workflow

final_wf <- finalize_workflow(texting_wf, best)

final_wf
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)

Main Arguments:
  penalty = 0.000542867543932386
  mixture = 1

Computational engine: glmnet 

Lets Fit!!

texting_fit <- 
  fit(final_wf, data = analysis_train)

texting_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_zv()
• step_impute_mode()
• step_impute_mean()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────

Call:  glmnet::glmnet(x = maybe_matrix(x), y = y, family = "binomial",      alpha = ~1) 

   Df %Dev   Lambda
1   0 0.00 0.065350
2   1 0.21 0.059540
3   1 0.39 0.054250
4   1 0.54 0.049430
5   1 0.66 0.045040
6   1 0.76 0.041040
7   1 0.85 0.037390
8   2 0.93 0.034070
9   4 1.07 0.031040
10  5 1.21 0.028290
11  7 1.36 0.025770
12  9 1.53 0.023480
13 10 1.69 0.021400
14 11 1.83 0.019500
15 11 1.95 0.017760
16 12 2.06 0.016190
17 14 2.15 0.014750
18 14 2.24 0.013440
19 15 2.33 0.012240
20 17 2.41 0.011160
21 20 2.50 0.010170
22 21 2.59 0.009263
23 24 2.67 0.008440
24 24 2.75 0.007690
25 26 2.82 0.007007
26 30 2.91 0.006384
27 31 3.00 0.005817
28 32 3.08 0.005300
29 32 3.14 0.004830
30 36 3.20 0.004401
31 38 3.25 0.004010
32 39 3.30 0.003653
33 39 3.35 0.003329
34 40 3.40 0.003033
35 43 3.43 0.002764
36 43 3.47 0.002518
37 43 3.50 0.002294
38 43 3.53 0.002091
39 43 3.55 0.001905
40 44 3.57 0.001736
41 44 3.58 0.001581
42 44 3.60 0.001441
43 45 3.61 0.001313
44 45 3.62 0.001196
45 45 3.62 0.001090
46 45 3.63 0.000993

...
and 8 more lines.

Review fit on the training data

texting_pred <- 
  augment(texting_fit, analysis_train) |> 
  select(TextingDriving, .pred_class, .pred_1, .pred_0)

texting_pred
# A tibble: 8,170 × 4
   TextingDriving .pred_class .pred_1 .pred_0
   <fct>          <fct>         <dbl>   <dbl>
 1 0              1             0.510   0.490
 2 1              0             0.437   0.563
 3 0              0             0.267   0.733
 4 1              0             0.399   0.601
 5 1              1             0.505   0.495
 6 1              1             0.615   0.385
 7 1              1             0.548   0.452
 8 1              0             0.460   0.540
 9 1              1             0.519   0.481
10 1              1             0.602   0.398
# ℹ 8,160 more rows

ROC plot and AUC value

roc_plot_training <- 
  texting_pred |> 
  roc_curve(truth = TextingDriving, .pred_1, event_level = "second") |> 
  autoplot()

roc_plot_training 

texting_pred |> 
  roc_auc(truth = TextingDriving,.pred_1, event_level = "second") 
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.622

Ckeck the estimates

texting_fit |> 
  extract_fit_parsnip() |> 
  tidy()
# A tibble: 51 × 3
   term                 estimate  penalty
   <chr>                   <dbl>    <dbl>
 1 (Intercept)           0.469   0.000543
 2 DrinkingDriver       -0.0224  0.000543
 3 WeaponCarrying        0.0593  0.000543
 4 WeaponCarryingSchool  0.0666  0.000543
 5 GunCarrying           0.0995  0.000543
 6 UnsafeAtSchool        0.0439  0.000543
 7 InjuredInSchool       0.00376 0.000543
 8 PhysicalFight        -0.0571  0.000543
 9 SchoolPhysicalFight  -0.0243  0.000543
10 TimesSexualAbuse      0.0484  0.000543
# ℹ 41 more rows

Lets fit on the TEST Data

texting_last_fit <- 
  last_fit(final_wf,
           split = analysis_split,
           metrics = metric_set(accuracy, kap, roc_auc))

texting_last_fit
# Resampling results
# Manual resampling 
# A tibble: 1 × 6
  splits              id               .metrics .notes   .predictions .workflow 
  <list>              <chr>            <list>   <list>   <list>       <list>    
1 <split [8170/2724]> train/test split <tibble> <tibble> <tibble>     <workflow>

Collect the Metrics



collect_metrics(texting_last_fit)
# A tibble: 3 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.569 Preprocessor1_Model1
2 kap      binary         0.105 Preprocessor1_Model1
3 roc_auc  binary         0.599 Preprocessor1_Model1

Review Prediction on the Test Data with yardstick

the_prediction <-  
  texting_last_fit |> 
  collect_predictions() 
  

the_prediction 
# A tibble: 2,724 × 7
   id               .pred_class  .row .pred_0 .pred_1 TextingDriving .config    
   <chr>            <fct>       <int>   <dbl>   <dbl> <fct>          <chr>      
 1 train/test split 0               3   0.589   0.411 1              Preprocess…
 2 train/test split 0               6   0.527   0.473 0              Preprocess…
 3 train/test split 1               8   0.492   0.508 0              Preprocess…
 4 train/test split 1              13   0.417   0.583 1              Preprocess…
 5 train/test split 1              20   0.460   0.540 0              Preprocess…
 6 train/test split 0              24   0.545   0.455 1              Preprocess…
 7 train/test split 1              28   0.305   0.695 1              Preprocess…
 8 train/test split 1              29   0.411   0.589 0              Preprocess…
 9 train/test split 1              30   0.221   0.779 1              Preprocess…
10 train/test split 1              32   0.456   0.544 0              Preprocess…
# ℹ 2,714 more rows

Review Prediction on the Test Data with yardstick

multi_metric <- metric_set(sens, spec, accuracy, kap)

multi_metric(the_prediction, 
             truth = TextingDriving, 
             estimate = .pred_class, 
             event_level = "second")
# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.713
2 spec     binary         0.389
3 accuracy binary         0.569
4 kap      binary         0.105

Confusion Matrix on the Test Data

ROC curve on the Test Data

Variable Importance Plot

library(vip)

texting_last_fit |> 
  extract_fit_engine() |> 
  vi() |> 
  group_by(Sign) |> 
  slice_max(Importance, n = 5) |> 
  ungroup() |> 
  ggplot(aes(Importance, fct_reorder(Variable, Importance), fill = Sign)) + 
  geom_col() +
  facet_wrap(vars(Sign), scales = "free_y") +
  labs(y = NULL) +
  theme(legend.position = "none")

We did it!

Review Questions

  1. Is the Residual Sum of Squares a relevant concept for L1 and L2 penalties? why?
  1. L2 regularization would be better of if we called it the squared regularization?
  1. If I want to reduce the dimensionality of a dataset I would use the _______ penalty?
  1. If I want tidymodels to perform a ridge regression what should I change in this code?
review_spec <-
  logistic_reg(penalty = tune(), 
               mixture = tune()) |> 
  set_engine('glmnet')
  1. BONUS: stepwise vs. lasso

Thank you!