Cross-Validation Strategies

Why One Split Is Not Enough

A single train/test split can be noisy, especially when the dataset is small. The result may depend too much on which observations happened to land in the test set.

Cross-validation reduces that instability by repeating the training/validation cycle across several partitions of the data.

k-Fold Cross-Validation

In k-fold cross-validation, we divide the data into \(k\) folds:

train on \(k-1\) folds,
validate on the remaining fold,
repeat until every fold has been used once for validation.

The average performance is

\[ \mathrm{CV} = \frac{1}{k}\sum_{i=1}^{k} M_i, \]

where \(M_i\) is the metric computed on fold \(i\).

Typical choices are \(k=5\) or \(k=10\). They usually balance bias, variance, and computation well.

Leave-One-Out Cross-Validation

LOOCV is the extreme case where \(k=n\). Each iteration leaves out one observation for validation.

It has two main properties:

it uses almost all available data for training each time,
but it can be computationally expensive and sometimes high-variance.

LOOCV is most useful when the dataset is small and the model is cheap to fit.

Stratified Cross-Validation

For imbalanced classification, standard k-fold can produce folds with unstable class proportions. Stratified k-fold keeps the class balance of each fold close to the full dataset.

That makes evaluation more reliable when one class is rare.

Time-Series Cross-Validation

Time-dependent data requires a different strategy. We must preserve order:

train on the past,
validate on a later period,
then roll or expand the window forward.

This avoids one of the most serious forms of leakage: using future information to predict the past.

Repeated k-Fold and Additional Considerations

Repeated k-fold runs the whole k-fold process several times with different random splits. It reduces the variance of the estimate and is often useful for smaller datasets.

Other practical reminders:

keep a final external test set if you use CV for model selection,
perform preprocessing inside each training fold,
use parallel computation when models are expensive,
and remember that some ensemble methods already provide internal estimates such as out-of-bag error.

Summary

In this lesson we covered:

Why a single split can be unreliable
How k-fold cross-validation averages performance across folds
When LOOCV is useful and when it is too expensive
Why stratification matters for imbalanced classification
Why time-series validation must respect chronology
How repeated k-fold improves stability

Next: With the evaluation strategy in place, we can decide which features deserve to stay in the model.

Introduction & Background

Simple Linear Regression

Inference & Diagnostic

Multiple Regression and Feature Engineering

Model Selection and Regularization

Generalized Linear Models (GLM) and Logistic Regression

Mathematical Annexes

Introduction and Partitioning

Splitting Criteria and Best Split

Growth Control and Pruning

Foundations of Model Evaluation

Metrics for Regression and Classification