Generalized Linear Models (GLM) and Logistic Regression

The generalized linear model (GLM) is a generalization of ordinary linear regression that allows the response variable to have distributions other than the normal distribution.

A GLM consists of three elements:

The response variable \(Y\) follows a distribution from the exponential family:

  • Normal distribution (ordinary linear regression)
  • Bernoulli distribution (logistic regression)
  • Poisson distribution (count data)
  • Exponential distribution (survival/time-to-event data)
  • Gamma distribution (positive continuous data with skewness)
\[ \eta = Xa = a_0 + a_1 X_1 + a_2 X_2 + \ldots + a_p X_p \]

This is the same linear combination of predictors as in ordinary regression.

A link function \(g\) connects the expected value \(\mu = E(Y)\) to the linear predictor:

\[ g(\mu) = \eta \]

or equivalently:

\[ E(Y) = \mu = g^{-1}(\eta) \]

The generalized linear model is defined by:

\[ E(Y) = g^{-1}(Xa) \]

For ordinary linear regression:

  • Distribution: Normal
  • Link function: \(g(x) = x\) (identity link)
  • Result: \(E(Y) = Xa\)

Different response distributions have corresponding canonical link functions:

Distribution Support Link Function \(g(\mu)\) Inverse Link \(g^{-1}(\eta)\) Use Case
Normal \(\mathbb{R}\) \(\mu\) (identity) \(\eta\) Continuous responses
Bernoulli \(\{0, 1\}\) \(\log\left(\frac{\mu}{1-\mu}\right)\) (logit) \(\frac{e^\eta}{1+e^\eta}\) Binary classification
Poisson \(\mathbb{N}\) \(\log(\mu)\) \(e^\eta\) Count data
Gamma \(\mathbb{R}^+\) \(\frac{1}{\mu}\) (inverse) \(\frac{1}{\eta}\) Positive continuous (e.g., insurance claims)

In GLMs, the unknown parameters \(a\) are typically estimated using maximum likelihood estimation (MLE):

\[ \hat{a} = \arg\max_a \, L(a \mid Y, X) = \arg\max_a \sum_{i=1}^{n} \log p(y_i \mid x_i, a) \]

Unlike ordinary regression (which has a closed-form solution), GLMs generally require iterative optimization algorithms:

  • Newton-Raphson method
  • Iteratively Reweighted Least Squares (IRLS)

Logistic regression is used when the response variable \(Y\) is binary (taking values 0 or 1).

When \(Y_i \in \{0, 1\}\), we model it as a Bernoulli random variable:

\[ Y_i \sim \text{Bernoulli}(p_i) \]

where \(p_i = P(Y_i = 1)\) is the probability of "success" for observation \(i\).

The expected value is:

\[ E(Y_i) = \mu_i = p_i \]

The logit (log-odds) link function is the canonical link for logistic regression:

\[ g(\mu) = \log\left(\frac{\mu}{1 - \mu}\right) = \log\left(\frac{p}{1 - p}\right) \]

This maps probabilities from \([0, 1]\) to the entire real line \((-\infty, +\infty)\).

Inverting the logit gives the logistic function (sigmoid):

\[ \mu = g^{-1}(\eta) = \frac{\exp(\eta)}{1 + \exp(\eta)} = \frac{1}{1 + \exp(-\eta)} \]

This maps the linear predictor \(\eta\) back to a probability in \([0, 1]\).


For a binary response, the logistic regression model is:

\[ \log\left(\frac{p_i}{1 - p_i}\right) = a_0 + a_1 X_{i1} + a_2 X_{i2} + \ldots + a_p X_{ip} \]

Equivalently:

\[ p_i = P(Y_i = 1 \mid X_i) = \frac{\exp(a_0 + a_1 X_{i1} + \ldots + a_p X_{ip})}{1 + \exp(a_0 + a_1 X_{i1} + \ldots + a_p X_{ip})} \]

Interpretation of coefficients:

  • \(a_j > 0\): Increasing \(X_j\) increases the log-odds (and thus the probability) of \(Y = 1\)
  • \(a_j < 0\): Increasing \(X_j\) decreases the log-odds of \(Y = 1\)
  • \(\exp(a_j)\): The odds ratio associated with a one-unit increase in \(X_j\)

For binary data, the likelihood is:

\[ L(a) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i} \]

Taking the logarithm:

\[ \ell(a) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]

Substituting \(p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\) where \(\eta_i = x_i^T a\):

\[ \ell(a) = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1 + \exp(\eta_i)) \right] \]

In practice, we minimize the negative log-likelihood:

\[ \text{NLL}(a) = -\ell(a) = -\sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]

This is also known as the binary cross-entropy loss.

Optimization: There is no closed-form solution. We use iterative algorithms:

  • Gradient descent
  • Newton-Raphson
  • IRLS (Iteratively Reweighted Least Squares)

The odds of \(Y = 1\) are:

\[ \text{Odds} = \frac{p}{1 - p} = \exp(\eta) = \exp(a_0 + a_1 X_1 + \ldots + a_p X_p) \]

For a one-unit increase in \(X_j\):

\[ \text{Odds Ratio} = \exp(a_j) \]

Example: If \(\exp(a_1) = 2.5\), then a one-unit increase in \(X_1\) multiplies the odds of \(Y = 1\) by 2.5.

To classify a new observation, we typically use a threshold (e.g., \(p = 0.5\)):

\[ \hat{Y} = \begin{cases} 1 & \text{if } p > 0.5 \\ 0 & \text{otherwise} \end{cases} \]

The decision boundary is where \(p = 0.5\), which occurs when:

\[ \eta = a_0 + a_1 X_1 + \ldots + a_p X_p = 0 \]

For two predictors, this defines a line in the feature space.


The deviance is a measure of model fit:

\[ D = -2 \ell(a) \]

Lower deviance indicates better fit.

For classification tasks, we use:

Predicted 0 Predicted 1
Actual 0 TN (True Neg) FP (False Pos)
Actual 1 FN (False Neg) TP (True Pos)

Metrics:

  • Accuracy: \(\frac{TP + TN}{TP + TN + FP + FN}\)
  • Precision: \(\frac{TP}{TP + FP}\)
  • Recall (Sensitivity): \(\frac{TP}{TP + FN}\)
  • Specificity: \(\frac{TN}{TN + FP}\)
  • F1-Score: Harmonic mean of precision and recall

The ROC curve (Receiver Operating Characteristic) plots True Positive Rate vs False Positive Rate at various threshold settings.

AUC (Area Under the Curve):

  • \(AUC = 1\): Perfect classifier
  • \(AUC = 0.5\): Random classifier
  • \(AUC > 0.8\): Generally considered good

Like linear regression, logistic regression can benefit from regularization when dealing with many predictors or collinearity.

\[ \text{NLL}(a) + \lambda \sum_{j=1}^{p} a_j^2 \] \[ \text{NLL}(a) + \lambda \sum_{j=1}^{p} |a_j| \]

Benefits:

  • Prevents overfitting
  • Can perform variable selection (Lasso)
  • Improves generalization to new data

For \(K > 2\) classes, we use softmax regression (multinomial logistic regression).

The probability that observation \(i\) belongs to class \(k\) is:

\[ p_{ik} = P(Y_i = k \mid X_i) = \frac{\exp(\eta_{ik})}{\sum_{j=1}^{K} \exp(\eta_{ij})} \]

where \(\eta_{ik} = a_k^T X_i\) for each class \(k\).

Training: Minimize the categorical cross-entropy loss:

\[ \text{Loss} = -\sum_{i=1}^{n} \sum_{k=1}^{K} \mathbb{1}(y_i = k) \log(p_{ik}) \]

In this lesson we covered:

  1. Generalized Linear Models (GLM): extending beyond normal distributions
  2. Three components of a GLM: distribution, linear predictor, link function
  3. Logistic regression: binary classification using the logit link
  4. Maximum likelihood estimation: solving the negative log-likelihood
  5. Interpretation: odds, odds ratios, decision boundaries
  6. Model evaluation: confusion matrix, accuracy, precision, recall, ROC/AUC
  7. Regularization: Ridge and Lasso for logistic regression
  8. Multiclass extension: softmax regression

This completes the linear models series! You now have a solid foundation in:

  • Simple and multiple linear regression
  • Inference and diagnostics
  • Feature engineering and polynomial regression
  • Model selection and regularization
  • GLMs and logistic regression for classification

  • Generalized Additive Models (GAMs): Non-parametric extensions of GLMs
  • Survival Analysis: Cox proportional hazards model
  • Count Data: Poisson and negative binomial regression
  • Hierarchical/Mixed Effects Models: Accounting for grouped/nested data structure