Cross-Validation Strategies

A single train/test split can be noisy, especially when the dataset is small. The result may depend too much on which observations happened to land in the test set.

Cross-validation reduces that instability by repeating the training/validation cycle across several partitions of the data.

In k-fold cross-validation, we divide the data into \(k\) folds:

  1. train on \(k-1\) folds,
  2. validate on the remaining fold,
  3. repeat until every fold has been used once for validation.

The average performance is

\[ \mathrm{CV} = \frac{1}{k}\sum_{i=1}^{k} M_i, \]

where \(M_i\) is the metric computed on fold \(i\).

Typical choices are \(k=5\) or \(k=10\). They usually balance bias, variance, and computation well.

LOOCV is the extreme case where \(k=n\). Each iteration leaves out one observation for validation.

It has two main properties:

  • it uses almost all available data for training each time,
  • but it can be computationally expensive and sometimes high-variance.

LOOCV is most useful when the dataset is small and the model is cheap to fit.

For imbalanced classification, standard k-fold can produce folds with unstable class proportions. Stratified k-fold keeps the class balance of each fold close to the full dataset.

That makes evaluation more reliable when one class is rare.

Time-dependent data requires a different strategy. We must preserve order:

  • train on the past,
  • validate on a later period,
  • then roll or expand the window forward.

This avoids one of the most serious forms of leakage: using future information to predict the past.

Repeated k-fold runs the whole k-fold process several times with different random splits. It reduces the variance of the estimate and is often useful for smaller datasets.

Other practical reminders:

  • keep a final external test set if you use CV for model selection,
  • perform preprocessing inside each training fold,
  • use parallel computation when models are expensive,
  • and remember that some ensemble methods already provide internal estimates such as out-of-bag error.

In this lesson we covered:

  1. Why a single split can be unreliable
  2. How k-fold cross-validation averages performance across folds
  3. When LOOCV is useful and when it is too expensive
  4. Why stratification matters for imbalanced classification
  5. Why time-series validation must respect chronology
  6. How repeated k-fold improves stability

Next: With the evaluation strategy in place, we can decide which features deserve to stay in the model.