KNN using tidymodels

Catalina Canizares


  • Understand the algorithm
  • Review of the math
  • The good and the bad
  tidymoels example


  • K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems

  • The KNN algorithm assumes that similar things exist in close proximity.

  • Similar things are near to each other.


Uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.

How does it calculate the distance

Euclidean distance


KNN does better than more powerful classifiers and is used in places such as genetics, data compression, and economic forecasting.

  • In political science – classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote”.

  • Banking system – KNN can be used to predict if a person is fit for loan approval. Or if he or she has similar traits to a defaulter.

  • Calculating credit ratings – KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.

Pros and Cons


  • It is straightforward and easy to implement it requires only the k-value parameter

  • There are almost no assumptions on the given data. The only thing that is assumed is nearby/similar instances belong to the same category.

  • It is a non-parametric approach


  • Inefficient for large datasets since distance has to be calculated throughout every point.

  • KNN assumes similar data points are close to each other. Therefore, the model is susceptible to outliers.

  • It cannot handle imbalanced data.


Predict whether an adolescent has been bullied or not based on a set of various risk behaviors.

Data Cleaning


riskyBehaviors_analysis <- 
  riskyBehaviors |> 
  mutate(Bullying = factor(Bullying)) |> 
  drop_na(Bullying) |> 
  select(- c(SourceAlcohol, SourceVaping, contains("Times"), contains("Days"), CyberBullying))

Splitting the data


bullying_split <- initial_split(riskyBehaviors_analysis, 
                               strata = Bullying)

bullying_train <- training(bullying_split)
bullying_test <- testing(bullying_split)


Lets Check Our Work

bullying_train |> 
  tabyl(Bullying)  |> 
  adorn_pct_formatting(0) 
 Bullying     n percent
        0  8058     80%
        1  2027     20%
    Total 10085       -
bullying_test |>  
  tabyl(Bullying)  |> 
  adorn_pct_formatting(0) 
 Bullying    n percent
        0 2686     80%
        1  676     20%
    Total 3362       -

The Recipe

# usemodels::use_kknn(Bullying ~ ., data = bullying_train)

bullying_recipe <- 
  recipe(formula = Bullying ~ ., data = bullying_train) |>
  step_downsample(Bullying , under_ratio = 1) |> 
  step_zv(all_predictors()) |> 
  step_normalize(all_numeric_predictors()) |> 
  step_impute_mode(all_nominal_predictors()) |>
  step_impute_mean(all_numeric_predictors()) 

The Specification

kknn_spec <- 
  nearest_neighbor(neighbors = 3) %>% 
  set_mode("classification") 

The Workflow

kknn_workflow <- 
  workflow() %>% 
  add_recipe(bullying_recipe) 

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps

• step_downsample()
• step_zv()
• step_normalize()
• step_impute_mode()
• step_impute_mean()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = 3

Computational engine: kknn 

Fit the model in the training set

bullying_fit <- fit(kknn_workflow, bullying_train)

Checking predictions in the training set

bullying_pred <- 
  augment(bullying_fit, bullying_train) |> 
  select(Bullying, .pred_class, .pred_1, .pred_0)

# A tibble: 10,085 × 4
   Bullying .pred_class .pred_1 .pred_0
   <fct>    <fct>         <dbl>   <dbl>
 1 0        0             0      1     
 2 0        0             0.244  0.756 
 3 0        0             0.244  0.756 
 4 0        0             0      1     
 5 0        1             0.756  0.244 
 6 0        1             0.690  0.310 
 7 0        0             0.244  0.756 
 8 0        0             0.310  0.690 
 9 0        0             0.244  0.756 
10 0        1             0.934  0.0656
# ℹ 10,075 more rows

Check the Performance

roc_plot <- 
  bullying_pred |> 
  roc_curve(truth = Bullying, 
           event_level = "second") |> 


bullying_pred |> 
  roc_auc(truth = Bullying, 
           event_level = "second")
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.866

Last fit

bullying_last_fit <- 
           split = bullying_split, 
           metrics = metric_set(kap, roc_auc, sens, spec))


Metrics in Testing Data

# A tibble: 4 × 4
  .metric .estimator .estimate .config             
  <chr>   <chr>          <dbl> <chr>               
1 kap     binary         0.102 Preprocessor1_Model1
2 sens    binary         0.636 Preprocessor1_Model1
3 spec    binary         0.503 Preprocessor1_Model1
4 roc_auc binary         0.607 Preprocessor1_Model1

Predictions in the testing set

predictions_testing <- 
  bullying_last_fit |> 

# A tibble: 3,362 × 7
   id               .pred_class  .row .pred_0 .pred_1 Bullying .config          
   <chr>            <fct>       <int>   <dbl>   <dbl> <fct>    <chr>            
 1 train/test split 1               6  0       1      1        Preprocessor1_Mo…
 2 train/test split 1               7  0.0656  0.934  1        Preprocessor1_Mo…
 3 train/test split 1               8  0.0656  0.934  1        Preprocessor1_Mo…
 4 train/test split 0              16  0.934   0.0656 0        Preprocessor1_Mo…
 5 train/test split 0              25  0.756   0.244  0        Preprocessor1_Mo…
 6 train/test split 0              34  0.934   0.0656 1        Preprocessor1_Mo…
 7 train/test split 0              37  0.756   0.244  0        Preprocessor1_Mo…
 8 train/test split 1              39  0.244   0.756  0        Preprocessor1_Mo…
 9 train/test split 0              40  1       0      0        Preprocessor1_Mo…
10 train/test split 0              51  0.934   0.0656 0        Preprocessor1_Mo…
# ℹ 3,352 more rows

Make sure your metrics are interpretable

multi_metric <- metric_set(sens, spec, accuracy, kap)

             truth = Bullying, 
             estimate = .pred_class, 
             event_level = "second")
# A tibble: 4 × 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 sens     binary         0.503
2 spec     binary         0.636
3 accuracy binary         0.609
4 kap      binary         0.102

Confusion Matrix in the testing set

conf_mat_test <- 
predictions_testing |> 
  conf_mat(Bullying, .pred_class) |> 
  autoplot(type = "heatmap")

We did it!