Imbalanced Data and Threshold Selection

Why Imbalance Changes the Problem

When one class is rare, standard accuracy can become misleading. A model may predict the majority class almost all the time and still look good on paper.

That is why imbalanced classification often shifts attention toward:

precision,
recall,
F1-score,
ROC-AUC or PR-AUC,
and threshold-dependent trade-offs.

Resampling Strategies

Common resampling options include:

undersampling the majority class,
oversampling the minority class,
SMOTE, which synthesizes new minority examples rather than duplicating old ones.

Each option has a trade-off. Undersampling may discard useful information, while naive oversampling can overfit repeated cases.

Cost-Sensitive and Balanced Models

Sometimes it is better to keep the original data and change the learning objective instead.

Examples include:

class weights in logistic regression or random forests,
balanced random forests,
boosting methods that emphasize hard or minority examples,
ensembles designed specifically for imbalanced data.

These methods tell the model that some mistakes are more expensive than others.

Threshold Selection

Many classifiers output probabilities rather than final labels. The threshold used to convert score into class is therefore part of the modeling decision.

Lower thresholds usually increase recall and decrease precision. Higher thresholds often do the opposite.

The right threshold depends on the task:

screening problems often prefer high recall,
high-cost false alarms may require high precision,
balanced situations may optimize F1 or a custom utility.

Data Augmentation

When training data is limited, augmentation can add useful diversity:

image flips, crops, or rotations,
text paraphrasing or replacement,
time-series windowing or noise injection,
synthetic minority generation such as SMOTE.

Augmentation is not a substitute for genuine data quality, but it can make models more robust when used carefully.

Summary

In this lesson we covered:

Why accuracy is often insufficient on imbalanced data
The main resampling strategies
Cost-sensitive and ensemble-based approaches
Why threshold choice changes practical performance
How data augmentation can support robustness and balance

Next: We now move from supervised evaluation to unsupervised structure discovery with clustering methods.

Introduction & Background

Simple Linear Regression

Inference & Diagnostic

Multiple Regression and Feature Engineering

Model Selection and Regularization

Generalized Linear Models (GLM) and Logistic Regression

Mathematical Annexes

Introduction and Partitioning

Splitting Criteria and Best Split

Growth Control and Pruning

Foundations of Model Evaluation

Metrics for Regression and Classification

Cross-Validation Strategies

Feature Selection and Preprocessing

Hyperparameter Tuning and Early Stopping