Catalina Canizares
Explore a specific hypothesis: statistical tests are used. An inferential model starts with a predefined conjecture or idea about a population and produces a statistical conclusion such as an interval estimate or the rejection of a hypothesis.
For example, the goal of a clinical trial might be to provide confirmation that a new therapy does a better job in prolonging life
Artificial intelligence is the name of a whole knowledge field.
Machine Learningis a part of artificial intelligence.
Neural Networks are one of machine learning types.
Algorithms to analyze and cluster unlabeled datasets
Clustering: groups unlabeled data based on their similarities or differences
Dimensionality reduction: Principal component analysis
Use of labeled datasets to train algorithms that classify data or predict outcomes accurately
There is a “y” or outcome variable.
Popular algorithms:
\[{y} = \alpha + \beta_1x_1 + \beta_2x_2 + \dots + \beta_nx_n + \epsilon\]
Care about prediction.
Focus on estimating y-hat.
\[\hat{y} = \hat{f}(x_1)\]
\(\hat{y}\) Represents the resulting prediction for \(Y\).
\(\hat{f}\) Represents the estimate for \(f\), which is often treated as a blackbox (No one is concerned with the exact form of \(\hat{f}\), provided that it yields accurate predictions for \(Y\)) Introduction to Statistical Learning
“When we discuss prediction models, prediction errors can be decomposed into two main subcomponents we care about: error due to ”bias” and error due to ”variance”.” (Fortman-Roe, 2012)
“There is a tradeoff between a model’s ability to minimize bias and variance.” (Fortman-Roe, 2012)
The error due to bias is taken as the difference between the expected (or average) prediction of our model and the correct value which we are trying to predict.
The error due to variance is taken as the variability of a model prediction for a given data point.
- The variance is how much the predictions for a given point vary between different realizations of the model.
For machine learning, we split data into training and test sets:
The training set is used to estimate model parameters.
The testing set is used to find an independent assessment of model performance.
🚫 CAUTION: Do not use the test set during training.
They are a tool consisting in repeatedly drawing samples from a dataset and calculating statistics and metrics on each of those samples.
This approach involves randomly dividing the set of observations into k folds of nearly equal size. The first fold is treated as a validation set and the model is fit on the remaining folds.
Only one observation is used for validation and the rest is used to fit the model.
Method | Hyperparameter | Description |
---|---|---|
Lasso | lambda | Regularization strength |
KNN | n_neighbors | Number of neighbors to consider |
KNN | weights | Weight function used in prediction: “uniform” or “distance” |
Trees | max_depth | Maximum depth of the tree |
Trees | min_samples_split | Minimum number of samples required to split an internal node |
Trees | min_samples_leaf | Minimum number of samples required to be at a leaf node |
Trees | max_features | Number of features to consider when looking for the best split |
Random Forest | n_estimators | Number of decision trees in the forest |
Random Forest | max_depth | Maximum depth of the decision trees |
Random Forest | min_samples_split | Minimum number of samples required to split an internal node |
Random Forest | min_samples_leaf | Minimum number of samples required to be at a leaf node |
Random Forest | max_features | Number of features to consider when looking for the best split |