Feature Selection and Preprocessing

Feature selection becomes important when:

  • the number of predictors is large,
  • some predictors are redundant or highly correlated,
  • the model is at risk of overfitting,
  • or training time is becoming unnecessarily expensive.

Good selection improves interpretability, reduces noise, and often makes the final model easier to trust.

Most feature-selection methods fall into three families:

  • Filter methods score features before model fitting.
  • Wrapper methods compare subsets by fitting models repeatedly.
  • Embedded methods perform selection during training.

Wrapper methods are usually the most expensive, but they align feature choice directly with predictive performance.

Common tools include:

  • forward / backward / stepwise selection,
  • recursive feature elimination,
  • adjusted \(R^2\) or validation score comparison,
  • ANOVA, AIC, and BIC for nested or competing models.

Multicollinearity is also important here. A common diagnostic is the variance inflation factor:

\[ \mathrm{VIF}_j = \frac{1}{1 - R_j^2}. \]

Large VIF values signal that a feature is explained too well by the others.

Filter methods are fast and model-agnostic. Typical examples include:

  • removing near-zero-variance features,
  • screening for missingness,
  • dropping highly correlated predictors,
  • ranking variables by mutual information or simple univariate scores.

They are useful for quickly cleaning a large feature set before more expensive selection begins.

Embedded methods learn feature importance while fitting the model itself. In linear models, the best-known examples are:

  • Ridge: shrinks coefficients but rarely removes them entirely
  • Lasso: can shrink some coefficients exactly to zero
  • Elastic Net: mixes ridge and lasso behavior

These methods are especially useful when predictors are numerous or correlated.

Feature selection is stronger when preprocessing is done carefully. Common steps include:

  • standardization when the method depends on scale,
  • log or Box-Cox transforms for skewed variables,
  • dummy encoding for categorical variables,
  • and dimension reduction such as PCA when the original space is too large.

The key rule is still the same: fit preprocessing on the training data only.

Even a carefully selected feature set can overfit if it was optimized too aggressively on one dataset. When possible, validate the final feature choice on:

  • a held-out test set,
  • a later time period,
  • or an external dataset collected under similar conditions.

That is the most reliable way to verify that the selected variables carry real signal.

In this lesson we covered:

  1. Why feature selection matters for generalization and interpretability
  2. The difference between filter, wrapper, and embedded methods
  3. How wrapper methods use performance and diagnostics such as VIF
  4. Why fast filter methods are useful for large feature sets
  5. How ridge, lasso, and elastic net act as embedded selectors
  6. Why preprocessing and validation must stay inside the training workflow

Next: We will tune hyperparameters and control complexity with early stopping and pruning.