Imbalanced Data and Threshold Selection

When one class is rare, standard accuracy can become misleading. A model may predict the majority class almost all the time and still look good on paper.

That is why imbalanced classification often shifts attention toward:

  • precision,
  • recall,
  • F1-score,
  • ROC-AUC or PR-AUC,
  • and threshold-dependent trade-offs.

Common resampling options include:

  • undersampling the majority class,
  • oversampling the minority class,
  • SMOTE, which synthesizes new minority examples rather than duplicating old ones.

Each option has a trade-off. Undersampling may discard useful information, while naive oversampling can overfit repeated cases.

Sometimes it is better to keep the original data and change the learning objective instead.

Examples include:

  • class weights in logistic regression or random forests,
  • balanced random forests,
  • boosting methods that emphasize hard or minority examples,
  • ensembles designed specifically for imbalanced data.

These methods tell the model that some mistakes are more expensive than others.

Many classifiers output probabilities rather than final labels. The threshold used to convert score into class is therefore part of the modeling decision.

Lower thresholds usually increase recall and decrease precision. Higher thresholds often do the opposite.

The right threshold depends on the task:

  • screening problems often prefer high recall,
  • high-cost false alarms may require high precision,
  • balanced situations may optimize F1 or a custom utility.

When training data is limited, augmentation can add useful diversity:

  • image flips, crops, or rotations,
  • text paraphrasing or replacement,
  • time-series windowing or noise injection,
  • synthetic minority generation such as SMOTE.

Augmentation is not a substitute for genuine data quality, but it can make models more robust when used carefully.

In this lesson we covered:

  1. Why accuracy is often insufficient on imbalanced data
  2. The main resampling strategies
  3. Cost-sensitive and ensemble-based approaches
  4. Why threshold choice changes practical performance
  5. How data augmentation can support robustness and balance

Next: We now move from supervised evaluation to unsupervised structure discovery with clustering methods.