KNN using tidymodels

Catalina Canizares

Agenda

  • Understand the algorithm
  • Review of the math
  • The good and the bad
  • tidymoels example

KNN

  • K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems

  • The KNN algorithm assumes that similar things exist in close proximity.

  • Similar things are near to each other.

KNN

Uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

How does it calculate the distance

Euclidean distance

KNN

KNN does better than more powerful classifiers and is used in places such as genetics, data compression, and economic forecasting.

  • In political science – classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote”.

  • Banking system – KNN can be used to predict if a person is fit for loan approval. Or if he or she has similar traits to a defaulter.

  • Calculating credit ratings – KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.

:::footer A Quick Introduction to KNN Algorithm

Pros and Cons

Pros

  • It is straightforward and easy to implement it requires only the k-value parameter

  • There are almost no assumptions on the given data. The only thing that is assumed is nearby/similar instances belong to the same category.

  • It is a non-parametric approach

Cons

  • Inefficient for large datasets since distance has to be calculated throughout every point.

  • KNN assumes similar data points are close to each other. Therefore, the model is susceptible to outliers.

  • It cannot handle imbalanced data.

Task

Predict whether an adolescent has been bullied or not based on a set of various risk behaviors.

Data Cleaning

data("riskyBehaviors")

riskyBehaviors_analysis <- 
  riskyBehaviors |> 
  mutate(Bullying = factor(Bullying)) |> 
  drop_na(Bullying) |> 
  select(- c(SourceAlcohol, SourceVaping, contains("Times"), contains("Days"), CyberBullying))

Splitting the data

set.seed(2023)

bullying_split <- initial_split(riskyBehaviors_analysis, 
                               strata = Bullying)

bullying_train <- training(bullying_split)
bullying_test <- testing(bullying_split)

bullying_split
<Training/Testing/Total>
<10085/3362/13447>

Lets Check Our Work

bullying_train |> 
  tabyl(Bullying)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()
 Bullying     n percent
        0  8058     80%
        1  2027     20%
    Total 10085       -
bullying_test |>  
  tabyl(Bullying)  |> 
  adorn_pct_formatting(0) |> 
  adorn_totals()
 Bullying    n percent
        0 2686     80%
        1  676     20%
    Total 3362       -

The Recipe

library(themis)
# usemodels::use_kknn(Bullying ~ ., data = bullying_train)

bullying_recipe <- 
  recipe(formula = Bullying ~ ., data = bullying_train) |>
  step_downsample(Bullying , under_ratio = 1) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |> 
  step_dummy(all_nominal_predictors())

The Specification

kknn_spec <- 
  nearest_neighbor(neighbors = 3) %>% 
  set_mode("classification") %>% 
  set_engine("kknn") 

The Workflow

kknn_workflow <- 
  workflow() %>% 
  add_recipe(bullying_recipe) %>% 
  add_model(kknn_spec) 

kknn_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_downsample()
• step_zv()
• step_normalize()
• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 3

Computational engine: kknn 

Fit the model in the training set

bullying_fit <- fit(kknn_workflow, bullying_train)
bullying_fit

Checking predictions in the training set

bullying_pred <- 
  augment(bullying_fit, bullying_train) |> 
  select(Bullying, .pred_class, .pred_1, .pred_0)

bullying_pred
# A tibble: 10,085 × 4
   Bullying .pred_class .pred_1 .pred_0
   <fct>    <fct>         <dbl>   <dbl>
 1 0        0             0      1     
 2 0        0             0.244  0.756 
 3 0        0             0.244  0.756 
 4 0        0             0      1     
 5 0        1             0.756  0.244 
 6 0        1             0.690  0.310 
 7 0        0             0.244  0.756 
 8 0        0             0.310  0.690 
 9 0        0             0.244  0.756 
10 0        1             0.934  0.0656
# ℹ 10,075 more rows

Check the Performance

roc_plot <- 
  bullying_pred |> 
  roc_curve(truth = Bullying, 
           .pred_1, 
           event_level = "second") |> 
  autoplot()

roc_plot

bullying_pred |> 
  roc_auc(truth = Bullying, 
           .pred_1, 
           event_level = "second")
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.866

Last fit

bullying_last_fit <- 
  last_fit(kknn_workflow, 
           split = bullying_split, 
           metrics = metric_set(kap, roc_auc, sens, spec))

bullying_last_fit 

Metrics in Testing Data

collect_metrics(bullying_last_fit)
# A tibble: 4 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 kap     binary         0.102 Preprocessor1_Model1
2 sens    binary         0.636 Preprocessor1_Model1
3 spec    binary         0.503 Preprocessor1_Model1
4 roc_auc binary         0.607 Preprocessor1_Model1

Predictions in the testing set

predictions_testing <- 
  bullying_last_fit |> 
  collect_predictions()

predictions_testing
# A tibble: 3,362 × 7
   id               .pred_class  .row .pred_0 .pred_1 Bullying .config          
   <chr>            <fct>       <int>   <dbl>   <dbl> <fct>    <chr>            
 1 train/test split 1               6  0       1      1        Preprocessor1_Mo…
 2 train/test split 1               7  0.0656  0.934  1        Preprocessor1_Mo…
 3 train/test split 1               8  0.0656  0.934  1        Preprocessor1_Mo…
 4 train/test split 0              16  0.934   0.0656 0        Preprocessor1_Mo…
 5 train/test split 0              25  0.756   0.244  0        Preprocessor1_Mo…
 6 train/test split 0              34  0.934   0.0656 1        Preprocessor1_Mo…
 7 train/test split 0              37  0.756   0.244  0        Preprocessor1_Mo…
 8 train/test split 1              39  0.244   0.756  0        Preprocessor1_Mo…
 9 train/test split 0              40  1       0      0        Preprocessor1_Mo…
10 train/test split 0              51  0.934   0.0656 0        Preprocessor1_Mo…
# ℹ 3,352 more rows

Make sure your metrics are interpretable

multi_metric <- metric_set(sens, spec, accuracy, kap)

multi_metric(predictions_testing, 
             truth = Bullying, 
             estimate = .pred_class, 
             event_level = "second")
# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.503
2 spec     binary         0.636
3 accuracy binary         0.609
4 kap      binary         0.102

Confusion Matrix in the testing set

conf_mat_test <- 
predictions_testing |> 
  conf_mat(Bullying, .pred_class) |> 
  autoplot(type = "heatmap")
conf_mat_test

We did it!