tidymodels
tidymoels
exampleK-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used for both classification as well as regression predictive problems
The KNN algorithm assumes that similar things exist in close proximity.
Similar things are near to each other.
Uses ‘feature similarity’ to predict the values of new datapoints which further means that the new data point will be assigned a value based on how closely it matches the points in the training set.
Euclidean distance
KNN does better than more powerful classifiers and is used in places such as genetics, data compression, and economic forecasting.
In political science – classing a political voter to “vote Republican” or “vote Democrat”, or to a “will vote” or “will not vote”.
Banking system – KNN can be used to predict if a person is fit for loan approval. Or if he or she has similar traits to a defaulter.
Calculating credit ratings – KNN can help when calculating an individual’s credit score by comparing it with persons with similar traits.
:::footer A Quick Introduction to KNN Algorithm
Pros
It is straightforward and easy to implement it requires only the k-value parameter
There are almost no assumptions on the given data. The only thing that is assumed is nearby/similar instances belong to the same category.
It is a non-parametric approach
Cons
Inefficient for large datasets since distance has to be calculated throughout every point.
KNN assumes similar data points are close to each other. Therefore, the model is susceptible to outliers.
It cannot handle imbalanced data.
Predict whether an adolescent has been bullied or not based on a set of various risk behaviors.
<Training/Testing/Total>
<10085/3362/13447>
library(themis)
# usemodels::use_kknn(Bullying ~ ., data = bullying_train)
bullying_recipe <-
recipe(formula = Bullying ~ ., data = bullying_train) |>
step_downsample(Bullying , under_ratio = 1) |>
step_zv(all_predictors()) |>
step_normalize(all_numeric_predictors()) |>
step_impute_mode(all_nominal_predictors()) |>
step_impute_mean(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: nearest_neighbor()
── Preprocessor ────────────────────────────────────────────────────────────────
6 Recipe Steps
• step_downsample()
• step_zv()
• step_normalize()
• step_impute_mode()
• step_impute_mean()
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
K-Nearest Neighbor Model Specification (classification)
Main Arguments:
neighbors = 3
Computational engine: kknn
# A tibble: 10,085 × 4
Bullying .pred_class .pred_1 .pred_0
<fct> <fct> <dbl> <dbl>
1 0 0 0 1
2 0 0 0.244 0.756
3 0 0 0.244 0.756
4 0 0 0 1
5 0 1 0.756 0.244
6 0 1 0.690 0.310
7 0 0 0.244 0.756
8 0 0 0.310 0.690
9 0 0 0.244 0.756
10 0 1 0.934 0.0656
# ℹ 10,075 more rows
# A tibble: 4 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 kap binary 0.102 Preprocessor1_Model1
2 sens binary 0.636 Preprocessor1_Model1
3 spec binary 0.503 Preprocessor1_Model1
4 roc_auc binary 0.607 Preprocessor1_Model1
# A tibble: 3,362 × 7
id .pred_class .row .pred_0 .pred_1 Bullying .config
<chr> <fct> <int> <dbl> <dbl> <fct> <chr>
1 train/test split 1 6 0 1 1 Preprocessor1_Mo…
2 train/test split 1 7 0.0656 0.934 1 Preprocessor1_Mo…
3 train/test split 1 8 0.0656 0.934 1 Preprocessor1_Mo…
4 train/test split 0 16 0.934 0.0656 0 Preprocessor1_Mo…
5 train/test split 0 25 0.756 0.244 0 Preprocessor1_Mo…
6 train/test split 0 34 0.934 0.0656 1 Preprocessor1_Mo…
7 train/test split 0 37 0.756 0.244 0 Preprocessor1_Mo…
8 train/test split 1 39 0.244 0.756 0 Preprocessor1_Mo…
9 train/test split 0 40 1 0 0 Preprocessor1_Mo…
10 train/test split 0 51 0.934 0.0656 0 Preprocessor1_Mo…
# ℹ 3,352 more rows
# A tibble: 4 × 3
.metric .estimator .estimate
<chr> <chr> <dbl>
1 sens binary 0.503
2 spec binary 0.636
3 accuracy binary 0.609
4 kap binary 0.102