Mathematical Annexes

This annex provides detailed mathematical derivations for the key results in linear models.

Annex A: OLS Solution for Simple Linear Regression

Problem Statement

For simple linear regression:

\[ Y_i = a X_i + b + \varepsilon_i \]

We want to minimize the sum of squared residuals:

\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - (aX_i + b))^2 \]

Derivation

Step 1: Expand the objective function

\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - aX_i - b)^2 \]

Step 2: Take partial derivatives

Partial derivative with respect to \(a\):

\[ \frac{\partial \text{SSE}}{\partial a} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-X_i) = -2 \sum_{i=1}^{n} X_i(Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - b \sum_{i=1}^{n} X_i \right] \]

Partial derivative with respect to \(b\):

\[ \frac{\partial \text{SSE}}{\partial b} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-1) = -2 \sum_{i=1}^{n} (Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb \right] \]

Step 3: Set partial derivatives to zero

Setting \(\frac{\partial \text{SSE}}{\partial b} = 0\):

\[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb = 0 \]\[ nb = \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i \]\[ b = \frac{1}{n} \sum_{i=1}^{n} Y_i - a \cdot \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{Y} - a\bar{X} \]

This gives us:

\[ \boxed{\hat{b} = \bar{Y} - \hat{a}\bar{X}} \]

Setting \(\frac{\partial \text{SSE}}{\partial a} = 0\) and substituting \(b = \bar{Y} - a\bar{X}\):

\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - (\bar{Y} - a\bar{X}) \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - \bar{Y} \sum_{i=1}^{n} X_i + a\bar{X} \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i = a \left( \sum_{i=1}^{n} X_i^2 - \bar{X} \sum_{i=1}^{n} X_i \right) \]

Step 4: Simplify using centered forms

Note that \(\sum_{i=1}^{n} X_i = n\bar{X}\), so:

\[ \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} = a \left( \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \right) \]

Recognizing covariance and variance:

\[ \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} \]\[ \sum_{i=1}^{n} (X_i - \bar{X})^2 = \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \]

Therefore:

\[ \boxed{\hat{a} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}} \]

Annex B: Normal Equation for Multiple Regression

Problem Statement

For multiple regression in matrix form:

\[ Y = Xa + E \]

We want to minimize:

\[ \text{SSE}(a) = \|Y - Xa\|^2 = (Y - Xa)^T(Y - Xa) \]

Derivation

Step 1: Expand the objective function

\[ \text{SSE}(a) = (Y - Xa)^T(Y - Xa) \]\[ = Y^T Y - Y^T Xa - a^T X^T Y + a^T X^T X a \]

Since \(Y^T Xa\) and \(a^T X^T Y\) are scalars and equal:

\[ = Y^T Y - 2a^T X^T Y + a^T X^T X a \]

Step 2: Take the derivative with respect to \(a\)

Using matrix calculus rules:

\(\frac{\partial}{\partial a}(a^T c) = c\) for constant vector \(c\)
\(\frac{\partial}{\partial a}(a^T A a) = 2Aa\) for symmetric matrix \(A\)

\[ \frac{\partial \text{SSE}}{\partial a} = -2X^T Y + 2X^T X a \]

Step 3: Set derivative to zero

\[ -2X^T Y + 2X^T X a = 0 \]\[ X^T X a = X^T Y \]

Step 4: Solve for \(a\)

Assuming \(X^T X\) is invertible (i.e., \(X\) has full column rank):

\[ \boxed{\hat{a} = (X^T X)^{-1} X^T Y} \]

Properties of the Normal Equation

Unbiasedness:

\[ E(\hat{a}) = E\left[ (X^T X)^{-1} X^T Y \right] = E\left[ (X^T X)^{-1} X^T (Xa + E) \right] \]\[ = (X^T X)^{-1} X^T X a + (X^T X)^{-1} X^T E(E) = a + 0 = a \]

Variance:

\[ \text{Var}(\hat{a}) = \text{Var}\left[ (X^T X)^{-1} X^T Y \right] \]

Since \(Y = Xa + E\):

\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \]\[ \hat{a} - a = (X^T X)^{-1} X^T E \]\[ \text{Var}(\hat{a}) = E\left[ (\hat{a} - a)(\hat{a} - a)^T \right] \]\[ = E\left[ (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \right] \]

Under the assumption \(E(E E^T) = \sigma^2 I\):

\[ = (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]

Annex C: Maximum Likelihood for Logistic Regression

Problem Statement

For logistic regression with binary response \(Y_i \in \{0, 1\}\):

\[ P(Y_i = 1 \mid X_i) = p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]

where \(\eta_i = x_i^T a\).

We want to find the maximum likelihood estimator \(\hat{a}\).

Likelihood Function

For a single observation:

\[ P(Y_i = y_i \mid X_i) = p_i^{y_i} (1 - p_i)^{1 - y_i} \]

For \(n\) independent observations:

\[ L(a) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i} \]

Log-Likelihood

Taking logarithms:

\[ \ell(a) = \log L(a) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]

Step 1: Express in terms of \(\eta_i\)

\[ p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]\[ 1 - p_i = \frac{1}{1 + \exp(\eta_i)} \]\[ \log(p_i) = \log\left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right) = \eta_i - \log(1 + \exp(\eta_i)) \]\[ \log(1 - p_i) = \log\left(\frac{1}{1 + \exp(\eta_i)}\right) = -\log(1 + \exp(\eta_i)) \]

Step 2: Substitute into log-likelihood

\[ \ell(a) = \sum_{i=1}^{n} \left[ y_i (\eta_i - \log(1 + \exp(\eta_i))) + (1 - y_i)(-\log(1 + \exp(\eta_i))) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - y_i \log(1 + \exp(\eta_i)) - \log(1 + \exp(\eta_i)) + y_i \log(1 + \exp(\eta_i)) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1 + \exp(\eta_i)) \right] \]\[ \boxed{\ell(a) = \sum_{i=1}^{n} \left[ y_i x_i^T a - \log(1 + \exp(x_i^T a)) \right]} \]

Negative Log-Likelihood (Loss Function)

To minimize (equivalent to maximizing \(\ell(a)\)):

\[ \boxed{\text{NLL}(a) = -\ell(a) = \sum_{i=1}^{n} \left[ \log(1 + \exp(x_i^T a)) - y_i x_i^T a \right]} \]

Alternatively, in the original probability form:

\[ \boxed{\text{NLL}(a) = -\sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]} \]

This is the binary cross-entropy loss.

Gradient of the Negative Log-Likelihood

To solve using gradient descent or Newton-Raphson, we need:

\[ \frac{\partial \text{NLL}}{\partial a_j} = \sum_{i=1}^{n} (p_i - y_i) x_{ij} \]

In vector form:

\[ \nabla_a \text{NLL} = X^T (p - y) \]

where:

\(X\) is the \(n \times p\) design matrix
\(p = (p_1, \ldots, p_n)^T\) is the vector of predicted probabilities
\(y = (y_1, \ldots, y_n)^T\) is the vector of observed responses

Hessian (Second Derivative Matrix)

For Newton-Raphson optimization:

\[ H = \frac{\partial^2 \text{NLL}}{\partial a \partial a^T} = X^T W X \]

where \(W\) is a diagonal matrix with:

\[ W_{ii} = p_i(1 - p_i) \]

Newton-Raphson Update

The iterative update rule is:

\[ a^{(t+1)} = a^{(t)} - (X^T W X)^{-1} X^T (p - y) \]

This is also known as Iteratively Reweighted Least Squares (IRLS) because the weights \(W\) change at each iteration.

Annex D: Derivation of Coefficient Variance

Distribution of the Estimator

Starting from the normal equation:

\[ \hat{a} = (X^T X)^{-1} X^T Y \]

Substituting \(Y = Xa + E\):

\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \]

Expected Value

\[ E(\hat{a}) = E\left[ a + (X^T X)^{-1} X^T E \right] = a + (X^T X)^{-1} X^T E(E) = a \]

(Since \(E(E) = 0\))

Thus, \(\hat{a}\) is unbiased.

Variance-Covariance Matrix

\[ (\hat{a} - a)(\hat{a} - a)^T = [(X^T X)^{-1} X^T E][(X^T X)^{-1} X^T E]^T \]\[ = (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \]

Taking expectation:

\[ E[(\hat{a} - a)(\hat{a} - a)^T] = (X^T X)^{-1} X^T E(E E^T) X (X^T X)^{-1} \]

Under the assumption \(E(E E^T) = \sigma^2 I_n\):

\[ = (X^T X)^{-1} X^T (\sigma^2 I_n) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} I_p (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]

Distribution Under Normality

If \(E \sim \mathcal{N}(0, \sigma^2 I)\), then:

\[ \boxed{\hat{a} \sim \mathcal{N}\left(a, \sigma^2 (X^T X)^{-1}\right)} \]

This result is fundamental for:

Hypothesis testing (t-tests for individual coefficients)
Confidence intervals
F-tests for nested models

Summary

These annexes provide the mathematical foundations for:

✅ OLS estimators for simple linear regression
✅ Normal equation for multiple regression
✅ Maximum likelihood estimation for logistic regression
✅ Distributional properties of coefficient estimators

Understanding these derivations helps build intuition for:

Why OLS works
When normal equations fail (singular \(X^T X\))
Why logistic regression requires iterative optimization
The role of normality assumptions in inference

Recommended Practice: Work through these derivations by hand to solidify your understanding of the mathematical foundations of linear models!