Model Selection and Regularization

Under the assumption of normality of the errors \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\), we can perform statistical inference about the coefficients.

From the normal equation \(\hat{a} = (X^T X)^{-1} X^T Y\), we have:

\[ E(\hat{a}) = a \]\[ \text{VAR}(\hat{a}) = \sigma^2 (X^T X)^{-1} \]

Therefore:

\[ \hat{a} \sim \mathcal{N}(a, \sigma^2 (X^T X)^{-1}) \]

The variance of the residuals can be approximated using the residual sum of squares:

\[ \hat{\sigma}^2 = \frac{\sum_{i=1}^{n} (\hat{\varepsilon}_i)^2}{n - p} = \frac{\|e\|^2}{n - p} \]

where \(p\) is the number of parameters (including intercept).


To test whether a coefficient \(a_j\) is significantly different from zero:

Hypotheses:

  • \(H_0\): \(a_j = 0\) (coefficient has no effect)
  • \(H_1\): \(a_j \neq 0\) (coefficient is significant)

Test statistic:

Under \(H_0\), the t-statistic follows a Student's t-distribution with \(n - p\) degrees of freedom:

\[ t_j = \frac{\hat{a}_j}{s_{\hat{a}_j}} \sim t_{n-p} \]

where \(s_{\hat{a}_j}\) is the standard error of \(\hat{a}_j\), obtained from the diagonal of:

\[ \text{VAR}(\hat{a}) = \hat{\sigma}^2 (X^T X)^{-1} \]

Decision rule:

  • If \(|t_j| > t_{\alpha/2, n-p}\), reject \(H_0\) at significance level \(\alpha\)
  • Typically use \(\alpha = 0.05\) (5%) or \(\alpha = 0.01\) (1%)

Based on the distribution of \(\hat{a}_j\), a \((1-\alpha)\) confidence interval for \(a_j\) is:

\[ \hat{a}_j \in \left[ \hat{a}_j - s_{\hat{a}_j} \, t_{\alpha/2, n-p} \;,\; \hat{a}_j + s_{\hat{a}_j} \, t_{\alpha/2, n-p} \right] \]

where \(t_{\alpha/2, n-p}\) is the critical value from the Student's t-distribution.

Interpretation: We are \((1-\alpha) \times 100\%\) confident that the true parameter \(a_j\) lies within this interval.

For a new observation with predictors \(x_{\text{new}}\), the predicted value is:

\[ \hat{y}_{\text{new}} = x_{\text{new}}^T \hat{a} \]

The variance of the prediction is:

\[ \text{VAR}(\hat{y}_{\text{new}}) = \sigma^2 x_{\text{new}}^T (X^T X)^{-1} x_{\text{new}} \]

A \((1-\alpha)\) confidence interval for the prediction is:

\[ \hat{y}_{\text{new}} \in \left[ \hat{y}_{\text{new}} - s_{\hat{y}} \, t_{\alpha/2, n-p} \;,\; \hat{y}_{\text{new}} + s_{\hat{y}} \, t_{\alpha/2, n-p} \right] \]

where:

\[ s_{\hat{y}} = \hat{\sigma} \sqrt{x_{\text{new}}^T (X^T X)^{-1} x_{\text{new}}} \]

Note: In simple linear regression, confidence bands have a hyperbolic shape, wider at the extremes of the data range.


\(R^2\) is at least weakly increasing with the number of regressors:

\[ R^2 = 1 - \frac{\text{SSE}}{\text{SST}} \]

Adding more predictors (even irrelevant ones) can never decrease \(R^2\). This makes it unsuitable for comparing models with different numbers of variables.

The adjusted R² penalizes model complexity:

\[ \bar{R}^2 = 1 - \frac{\text{SSE}/(n-p)}{\text{SST}/(n-1)} = 1 - (1 - R^2) \frac{n-1}{n-p} \]

Properties:

  • Can decrease if adding a predictor doesn't improve fit enough to justify the extra parameter
  • More appropriate for model comparison
  • Still has limitations (not a likelihood-based criterion)

The F-distribution with parameters \(d_1\) and \(d_2\) arises as the ratio of two independent chi-squared variables:

If \(S_1 \sim \chi^2_{d_1}\) and \(S_2 \sim \chi^2_{d_2}\), then:

\[ F = \frac{S_1 / d_1}{S_2 / d_2} \sim F(d_1, d_2) \]

Properties:

  • Always non-negative
  • Right-skewed
  • Shape depends on both \(d_1\) and \(d_2\)

The F-test of overall significance compares a model with predictors to an intercept-only model.

Hypotheses:

  • \(H_0\): \(a = 0\) (the model is no better than the mean)
  • \(H_1\): \(a \neq 0\) (the model has explanatory power)

Test statistic:

\[ F_{\text{obs}} = \frac{\text{MSR}}{\text{MSE}} = \frac{\text{SSR}/1}{\text{SSE}/(n-2)} \]

where:

  • SSR (Sum of Squares Regression) = \(\sum_{i=1}^{n} (\hat{Y}_i - \bar{Y})^2\)
  • SSE (Sum of Squares Error) = \(\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2\)
  • MSR (Mean Square Regression) = \(\text{SSR} / 1\)
  • MSE (Mean Square Error) = \(\text{SSE} / (n-2)\)

Under \(H_0\):

\[ F_{\text{obs}} \sim F(1, n-2) \]

For simple linear regression, there's a direct relationship:

\[ F_{\text{obs}} = \frac{R^2}{1 - R^2} \cdot (n - 2) \]

Consider two nested models:

  • Model 1 (restricted): \(p_1\) parameters
  • Model 2 (unrestricted): \(p_2\) parameters, where \(p_2 > p_1\)

Model 1 is "nested" within Model 2 if every model in 1 can be represented by some choice of parameters in 2.

Hypotheses:

  • \(H_0\): Model 2 does not provide a significantly better fit than Model 1
  • \(H_1\): Model 2 provides a significantly better fit

Test statistic:

\[ F = \frac{(\text{SSE}_1 - \text{SSE}_2) / (p_2 - p_1)}{\text{SSE}_2 / (n - p_2)} \sim F(p_2 - p_1, n - p_2) \]

Interpretation:

  • Large \(F\) → Reject \(H_0\) → The additional predictors improve the model
  • Small \(F\) → Fail to reject \(H_0\) → The additional predictors are not justified

When we have many potential predictors, we need systematic methods to select which ones to include.

Algorithm:

  1. Start with the full model (all predictors)
  2. Choose a significance level to stay (SLS), e.g., 0.05
  3. Fit the model and find the predictor with the largest p-value
  4. If \(p\text{-value} > \text{SLS}\), remove that predictor
  5. Refit the model with remaining predictors
  6. Repeat steps 3-5 until all predictors have \(p\text{-value} \leq \text{SLS}\)

Pros: Simple, considers interactions among predictors Cons: Can miss important variables if they're only significant in combination with others

Algorithm:

  1. Start with the intercept-only model (no predictors)
  2. Choose a significance level to enter (SLE), e.g., 0.05
  3. Fit all simple models (one predictor each)
  4. Find the predictor with the smallest p-value
  5. If \(p\text{-value} < \text{SLE}\), add that predictor
  6. Among remaining predictors, fit two-variable models including the selected variable
  7. Repeat steps 4-6 until no additional variables have \(p\text{-value} < \text{SLE}\)

Pros: Computationally efficient for large predictor sets Cons: Once a variable is added, it stays (can't remove it later)

Algorithm: Combines forward selection and backward elimination:

  1. Start like forward selection
  2. After each new variable is added, perform backward elimination on all variables
  3. Continue until no variables can be added or removed

Pros: More flexible than forward or backward alone Cons: Still greedy, no guarantee of finding the optimal model


\[ \text{AIC} = 2p - 2\ln(L) \]

where:

  • \(p\) is the number of parameters
  • \(L\) is the maximized likelihood

For linear regression:

\[ \text{AIC} = n \ln\left(\frac{\text{SSE}}{n}\right) + 2p \]

Lower AIC is better (balances fit and complexity)

\[ \text{BIC} = p \ln(n) - 2\ln(L) \]

For linear regression:

\[ \text{BIC} = n \ln\left(\frac{\text{SSE}}{n}\right) + p \ln(n) \]

Lower BIC is better

BIC vs AIC:

  • BIC penalizes complexity more heavily than AIC (especially for large \(n\))
  • BIC tends to select simpler models than AIC
  • Both are useful; often compare multiple criteria

Collinearity occurs when predictors are highly correlated with each other.

Perfect collinearity: One predictor is an exact linear combination of others

  • \(X^T X\) is singular (not invertible)
  • Parameters are non-identifiable

High collinearity (but not perfect):

  • \(X^T X\) is technically invertible but numerically unstable
  • Coefficient estimates have large standard errors
  • Small changes in data cause large changes in coefficients

Variance Inflation Factor (VIF):

For predictor \(j\):

\[ \text{VIF}_j = \frac{1}{1 - R_j^2} \]

where \(R_j^2\) is the \(R^2\) from regressing \(X_j\) on all other predictors.

Rule of thumb:

  • \(\text{VIF} > 5\): Moderate collinearity
  • \(\text{VIF} > 10\): Severe collinearity
  1. Remove one of the correlated predictors
  2. Combine predictors (e.g., create an average or principal component)
  3. Regularization methods: Ridge, Lasso, Elastic Net
  4. Collect more data (if possible)

When \(p\) is large or predictors are highly correlated, regularization can help.

Objective:

\[ \min_a \left\{ \|Y - Xa\|^2 + \lambda \|a\|^2 \right\} \]
  • Adds an \(\ell_2\) penalty on coefficient magnitudes
  • Shrinks coefficients toward zero (but never exactly zero)
  • Solution: \(\hat{a}_{\text{ridge}} = (X^T X + \lambda I)^{-1} X^T Y\)

Objective:

\[ \min_a \left\{ \|Y - Xa\|^2 + \lambda \|a\|_1 \right\} \]
  • Adds an \(\ell_1\) penalty on coefficient magnitudes
  • Can set some coefficients exactly to zero (variable selection)
  • No closed-form solution (requires optimization algorithms)

Choosing \(\lambda\): Use cross-validation to select the regularization parameter that minimizes prediction error on held-out data.


In this lesson we covered:

  1. Inference in regression: hypothesis tests and confidence intervals for coefficients
  2. Adjusted R² to account for model complexity
  3. F-distribution and F-tests for overall significance and nested model comparison
  4. Model selection methods: backward elimination, forward selection, stepwise regression
  5. Information criteria: AIC and BIC
  6. Collinearity: detection (VIF) and solutions
  7. Regularization: Ridge and Lasso for high-dimensional problems

Next: We'll extend beyond normal linear regression to Generalized Linear Models (GLM) and logistic regression for classification problems.