Mathematical Annexes

This annex provides detailed mathematical derivations for the key results in linear models.


For simple linear regression:

\[ Y_i = a X_i + b + \varepsilon_i \]

We want to minimize the sum of squared residuals:

\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - (aX_i + b))^2 \]

Step 1: Expand the objective function

\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - aX_i - b)^2 \]

Step 2: Take partial derivatives

Partial derivative with respect to \(a\):

\[ \frac{\partial \text{SSE}}{\partial a} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-X_i) = -2 \sum_{i=1}^{n} X_i(Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - b \sum_{i=1}^{n} X_i \right] \]

Partial derivative with respect to \(b\):

\[ \frac{\partial \text{SSE}}{\partial b} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-1) = -2 \sum_{i=1}^{n} (Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb \right] \]

Step 3: Set partial derivatives to zero

Setting \(\frac{\partial \text{SSE}}{\partial b} = 0\):

\[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb = 0 \]\[ nb = \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i \]\[ b = \frac{1}{n} \sum_{i=1}^{n} Y_i - a \cdot \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{Y} - a\bar{X} \]

This gives us:

\[ \boxed{\hat{b} = \bar{Y} - \hat{a}\bar{X}} \]

Setting \(\frac{\partial \text{SSE}}{\partial a} = 0\) and substituting \(b = \bar{Y} - a\bar{X}\):

\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - (\bar{Y} - a\bar{X}) \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - \bar{Y} \sum_{i=1}^{n} X_i + a\bar{X} \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i = a \left( \sum_{i=1}^{n} X_i^2 - \bar{X} \sum_{i=1}^{n} X_i \right) \]

Step 4: Simplify using centered forms

Note that \(\sum_{i=1}^{n} X_i = n\bar{X}\), so:

\[ \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} = a \left( \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \right) \]

Recognizing covariance and variance:

\[ \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} \]\[ \sum_{i=1}^{n} (X_i - \bar{X})^2 = \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \]

Therefore:

\[ \boxed{\hat{a} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}} \]

For multiple regression in matrix form:

\[ Y = Xa + E \]

We want to minimize:

\[ \text{SSE}(a) = \|Y - Xa\|^2 = (Y - Xa)^T(Y - Xa) \]

Step 1: Expand the objective function

\[ \text{SSE}(a) = (Y - Xa)^T(Y - Xa) \]\[ = Y^T Y - Y^T Xa - a^T X^T Y + a^T X^T X a \]

Since \(Y^T Xa\) and \(a^T X^T Y\) are scalars and equal:

\[ = Y^T Y - 2a^T X^T Y + a^T X^T X a \]

Step 2: Take the derivative with respect to \(a\)

Using matrix calculus rules:

  • \(\frac{\partial}{\partial a}(a^T c) = c\) for constant vector \(c\)
  • \(\frac{\partial}{\partial a}(a^T A a) = 2Aa\) for symmetric matrix \(A\)
\[ \frac{\partial \text{SSE}}{\partial a} = -2X^T Y + 2X^T X a \]

Step 3: Set derivative to zero

\[ -2X^T Y + 2X^T X a = 0 \]\[ X^T X a = X^T Y \]

Step 4: Solve for \(a\)

Assuming \(X^T X\) is invertible (i.e., \(X\) has full column rank):

\[ \boxed{\hat{a} = (X^T X)^{-1} X^T Y} \]

Unbiasedness:

\[ E(\hat{a}) = E\left[ (X^T X)^{-1} X^T Y \right] = E\left[ (X^T X)^{-1} X^T (Xa + E) \right] \]\[ = (X^T X)^{-1} X^T X a + (X^T X)^{-1} X^T E(E) = a + 0 = a \]

Variance:

\[ \text{Var}(\hat{a}) = \text{Var}\left[ (X^T X)^{-1} X^T Y \right] \]

Since \(Y = Xa + E\):

\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \]\[ \hat{a} - a = (X^T X)^{-1} X^T E \]\[ \text{Var}(\hat{a}) = E\left[ (\hat{a} - a)(\hat{a} - a)^T \right] \]\[ = E\left[ (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \right] \]

Under the assumption \(E(E E^T) = \sigma^2 I\):

\[ = (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]

For logistic regression with binary response \(Y_i \in \{0, 1\}\):

\[ P(Y_i = 1 \mid X_i) = p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]

where \(\eta_i = x_i^T a\).

We want to find the maximum likelihood estimator \(\hat{a}\).

For a single observation:

\[ P(Y_i = y_i \mid X_i) = p_i^{y_i} (1 - p_i)^{1 - y_i} \]

For \(n\) independent observations:

\[ L(a) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i} \]

Taking logarithms:

\[ \ell(a) = \log L(a) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]

Step 1: Express in terms of \(\eta_i\)

\[ p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]\[ 1 - p_i = \frac{1}{1 + \exp(\eta_i)} \]\[ \log(p_i) = \log\left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right) = \eta_i - \log(1 + \exp(\eta_i)) \]\[ \log(1 - p_i) = \log\left(\frac{1}{1 + \exp(\eta_i)}\right) = -\log(1 + \exp(\eta_i)) \]

Step 2: Substitute into log-likelihood

\[ \ell(a) = \sum_{i=1}^{n} \left[ y_i (\eta_i - \log(1 + \exp(\eta_i))) + (1 - y_i)(-\log(1 + \exp(\eta_i))) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - y_i \log(1 + \exp(\eta_i)) - \log(1 + \exp(\eta_i)) + y_i \log(1 + \exp(\eta_i)) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1 + \exp(\eta_i)) \right] \]\[ \boxed{\ell(a) = \sum_{i=1}^{n} \left[ y_i x_i^T a - \log(1 + \exp(x_i^T a)) \right]} \]

To minimize (equivalent to maximizing \(\ell(a)\)):

\[ \boxed{\text{NLL}(a) = -\ell(a) = \sum_{i=1}^{n} \left[ \log(1 + \exp(x_i^T a)) - y_i x_i^T a \right]} \]

Alternatively, in the original probability form:

\[ \boxed{\text{NLL}(a) = -\sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]} \]

This is the binary cross-entropy loss.

To solve using gradient descent or Newton-Raphson, we need:

\[ \frac{\partial \text{NLL}}{\partial a_j} = \sum_{i=1}^{n} (p_i - y_i) x_{ij} \]

In vector form:

\[ \nabla_a \text{NLL} = X^T (p - y) \]

where:

  • \(X\) is the \(n \times p\) design matrix
  • \(p = (p_1, \ldots, p_n)^T\) is the vector of predicted probabilities
  • \(y = (y_1, \ldots, y_n)^T\) is the vector of observed responses

For Newton-Raphson optimization:

\[ H = \frac{\partial^2 \text{NLL}}{\partial a \partial a^T} = X^T W X \]

where \(W\) is a diagonal matrix with:

\[ W_{ii} = p_i(1 - p_i) \]

The iterative update rule is:

\[ a^{(t+1)} = a^{(t)} - (X^T W X)^{-1} X^T (p - y) \]

This is also known as Iteratively Reweighted Least Squares (IRLS) because the weights \(W\) change at each iteration.


Starting from the normal equation:

\[ \hat{a} = (X^T X)^{-1} X^T Y \]

Substituting \(Y = Xa + E\):

\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \] \[ E(\hat{a}) = E\left[ a + (X^T X)^{-1} X^T E \right] = a + (X^T X)^{-1} X^T E(E) = a \]

(Since \(E(E) = 0\))

Thus, \(\hat{a}\) is unbiased.

\[ (\hat{a} - a)(\hat{a} - a)^T = [(X^T X)^{-1} X^T E][(X^T X)^{-1} X^T E]^T \]\[ = (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \]

Taking expectation:

\[ E[(\hat{a} - a)(\hat{a} - a)^T] = (X^T X)^{-1} X^T E(E E^T) X (X^T X)^{-1} \]

Under the assumption \(E(E E^T) = \sigma^2 I_n\):

\[ = (X^T X)^{-1} X^T (\sigma^2 I_n) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} I_p (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]

If \(E \sim \mathcal{N}(0, \sigma^2 I)\), then:

\[ \boxed{\hat{a} \sim \mathcal{N}\left(a, \sigma^2 (X^T X)^{-1}\right)} \]

This result is fundamental for:

  • Hypothesis testing (t-tests for individual coefficients)
  • Confidence intervals
  • F-tests for nested models

These annexes provide the mathematical foundations for:

  1. OLS estimators for simple linear regression
  2. Normal equation for multiple regression
  3. Maximum likelihood estimation for logistic regression
  4. Distributional properties of coefficient estimators

Understanding these derivations helps build intuition for:

  • Why OLS works
  • When normal equations fail (singular \(X^T X\))
  • Why logistic regression requires iterative optimization
  • The role of normality assumptions in inference

Recommended Practice: Work through these derivations by hand to solidify your understanding of the mathematical foundations of linear models!