Mathematical Annexes
This annex provides detailed mathematical derivations for the key results in linear models.
Annex A: OLS Solution for Simple Linear Regression
Problem Statement
For simple linear regression:
\[ Y_i = a X_i + b + \varepsilon_i \]We want to minimize the sum of squared residuals:
\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - (aX_i + b))^2 \]Derivation
Step 1: Expand the objective function
\[ \text{SSE}(a, b) = \sum_{i=1}^{n} (Y_i - aX_i - b)^2 \]Step 2: Take partial derivatives
Partial derivative with respect to \(a\):
\[ \frac{\partial \text{SSE}}{\partial a} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-X_i) = -2 \sum_{i=1}^{n} X_i(Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - b \sum_{i=1}^{n} X_i \right] \]Partial derivative with respect to \(b\):
\[ \frac{\partial \text{SSE}}{\partial b} = \sum_{i=1}^{n} 2(Y_i - aX_i - b)(-1) = -2 \sum_{i=1}^{n} (Y_i - aX_i - b) \]\[ = -2 \left[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb \right] \]Step 3: Set partial derivatives to zero
Setting \(\frac{\partial \text{SSE}}{\partial b} = 0\):
\[ \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i - nb = 0 \]\[ nb = \sum_{i=1}^{n} Y_i - a \sum_{i=1}^{n} X_i \]\[ b = \frac{1}{n} \sum_{i=1}^{n} Y_i - a \cdot \frac{1}{n} \sum_{i=1}^{n} X_i = \bar{Y} - a\bar{X} \]This gives us:
\[ \boxed{\hat{b} = \bar{Y} - \hat{a}\bar{X}} \]Setting \(\frac{\partial \text{SSE}}{\partial a} = 0\) and substituting \(b = \bar{Y} - a\bar{X}\):
\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - (\bar{Y} - a\bar{X}) \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - a \sum_{i=1}^{n} X_i^2 - \bar{Y} \sum_{i=1}^{n} X_i + a\bar{X} \sum_{i=1}^{n} X_i = 0 \]\[ \sum_{i=1}^{n} X_i Y_i - \bar{Y} \sum_{i=1}^{n} X_i = a \left( \sum_{i=1}^{n} X_i^2 - \bar{X} \sum_{i=1}^{n} X_i \right) \]Step 4: Simplify using centered forms
Note that \(\sum_{i=1}^{n} X_i = n\bar{X}\), so:
\[ \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} = a \left( \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \right) \]Recognizing covariance and variance:
\[ \sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y}) = \sum_{i=1}^{n} X_i Y_i - n\bar{X}\bar{Y} \]\[ \sum_{i=1}^{n} (X_i - \bar{X})^2 = \sum_{i=1}^{n} X_i^2 - n\bar{X}^2 \]Therefore:
\[ \boxed{\hat{a} = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n} (X_i - \bar{X})^2} = \frac{\text{Cov}(X, Y)}{\text{Var}(X)}} \]Annex B: Normal Equation for Multiple Regression
Problem Statement
For multiple regression in matrix form:
\[ Y = Xa + E \]We want to minimize:
\[ \text{SSE}(a) = \|Y - Xa\|^2 = (Y - Xa)^T(Y - Xa) \]Derivation
Step 1: Expand the objective function
\[ \text{SSE}(a) = (Y - Xa)^T(Y - Xa) \]\[ = Y^T Y - Y^T Xa - a^T X^T Y + a^T X^T X a \]Since \(Y^T Xa\) and \(a^T X^T Y\) are scalars and equal:
\[ = Y^T Y - 2a^T X^T Y + a^T X^T X a \]Step 2: Take the derivative with respect to \(a\)
Using matrix calculus rules:
- \(\frac{\partial}{\partial a}(a^T c) = c\) for constant vector \(c\)
- \(\frac{\partial}{\partial a}(a^T A a) = 2Aa\) for symmetric matrix \(A\)
Step 3: Set derivative to zero
\[ -2X^T Y + 2X^T X a = 0 \]\[ X^T X a = X^T Y \]Step 4: Solve for \(a\)
Assuming \(X^T X\) is invertible (i.e., \(X\) has full column rank):
\[ \boxed{\hat{a} = (X^T X)^{-1} X^T Y} \]Properties of the Normal Equation
Unbiasedness:
\[ E(\hat{a}) = E\left[ (X^T X)^{-1} X^T Y \right] = E\left[ (X^T X)^{-1} X^T (Xa + E) \right] \]\[ = (X^T X)^{-1} X^T X a + (X^T X)^{-1} X^T E(E) = a + 0 = a \]Variance:
\[ \text{Var}(\hat{a}) = \text{Var}\left[ (X^T X)^{-1} X^T Y \right] \]Since \(Y = Xa + E\):
\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \]\[ \hat{a} - a = (X^T X)^{-1} X^T E \]\[ \text{Var}(\hat{a}) = E\left[ (\hat{a} - a)(\hat{a} - a)^T \right] \]\[ = E\left[ (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \right] \]Under the assumption \(E(E E^T) = \sigma^2 I\):
\[ = (X^T X)^{-1} X^T (\sigma^2 I) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]Annex C: Maximum Likelihood for Logistic Regression
Problem Statement
For logistic regression with binary response \(Y_i \in \{0, 1\}\):
\[ P(Y_i = 1 \mid X_i) = p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]where \(\eta_i = x_i^T a\).
We want to find the maximum likelihood estimator \(\hat{a}\).
Likelihood Function
For a single observation:
\[ P(Y_i = y_i \mid X_i) = p_i^{y_i} (1 - p_i)^{1 - y_i} \]For \(n\) independent observations:
\[ L(a) = \prod_{i=1}^{n} p_i^{y_i} (1 - p_i)^{1 - y_i} \]Log-Likelihood
Taking logarithms:
\[ \ell(a) = \log L(a) = \sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right] \]Step 1: Express in terms of \(\eta_i\)
\[ p_i = \frac{\exp(\eta_i)}{1 + \exp(\eta_i)} = \frac{1}{1 + \exp(-\eta_i)} \]\[ 1 - p_i = \frac{1}{1 + \exp(\eta_i)} \]\[ \log(p_i) = \log\left(\frac{\exp(\eta_i)}{1 + \exp(\eta_i)}\right) = \eta_i - \log(1 + \exp(\eta_i)) \]\[ \log(1 - p_i) = \log\left(\frac{1}{1 + \exp(\eta_i)}\right) = -\log(1 + \exp(\eta_i)) \]Step 2: Substitute into log-likelihood
\[ \ell(a) = \sum_{i=1}^{n} \left[ y_i (\eta_i - \log(1 + \exp(\eta_i))) + (1 - y_i)(-\log(1 + \exp(\eta_i))) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - y_i \log(1 + \exp(\eta_i)) - \log(1 + \exp(\eta_i)) + y_i \log(1 + \exp(\eta_i)) \right] \]\[ = \sum_{i=1}^{n} \left[ y_i \eta_i - \log(1 + \exp(\eta_i)) \right] \]\[ \boxed{\ell(a) = \sum_{i=1}^{n} \left[ y_i x_i^T a - \log(1 + \exp(x_i^T a)) \right]} \]Negative Log-Likelihood (Loss Function)
To minimize (equivalent to maximizing \(\ell(a)\)):
\[ \boxed{\text{NLL}(a) = -\ell(a) = \sum_{i=1}^{n} \left[ \log(1 + \exp(x_i^T a)) - y_i x_i^T a \right]} \]Alternatively, in the original probability form:
\[ \boxed{\text{NLL}(a) = -\sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]} \]This is the binary cross-entropy loss.
Gradient of the Negative Log-Likelihood
To solve using gradient descent or Newton-Raphson, we need:
\[ \frac{\partial \text{NLL}}{\partial a_j} = \sum_{i=1}^{n} (p_i - y_i) x_{ij} \]In vector form:
\[ \nabla_a \text{NLL} = X^T (p - y) \]where:
- \(X\) is the \(n \times p\) design matrix
- \(p = (p_1, \ldots, p_n)^T\) is the vector of predicted probabilities
- \(y = (y_1, \ldots, y_n)^T\) is the vector of observed responses
Hessian (Second Derivative Matrix)
For Newton-Raphson optimization:
\[ H = \frac{\partial^2 \text{NLL}}{\partial a \partial a^T} = X^T W X \]where \(W\) is a diagonal matrix with:
\[ W_{ii} = p_i(1 - p_i) \]Newton-Raphson Update
The iterative update rule is:
\[ a^{(t+1)} = a^{(t)} - (X^T W X)^{-1} X^T (p - y) \]This is also known as Iteratively Reweighted Least Squares (IRLS) because the weights \(W\) change at each iteration.
Annex D: Derivation of Coefficient Variance
Distribution of the Estimator
Starting from the normal equation:
\[ \hat{a} = (X^T X)^{-1} X^T Y \]Substituting \(Y = Xa + E\):
\[ \hat{a} = (X^T X)^{-1} X^T (Xa + E) = a + (X^T X)^{-1} X^T E \]Expected Value
\[ E(\hat{a}) = E\left[ a + (X^T X)^{-1} X^T E \right] = a + (X^T X)^{-1} X^T E(E) = a \](Since \(E(E) = 0\))
Thus, \(\hat{a}\) is unbiased.
Variance-Covariance Matrix
\[ (\hat{a} - a)(\hat{a} - a)^T = [(X^T X)^{-1} X^T E][(X^T X)^{-1} X^T E]^T \]\[ = (X^T X)^{-1} X^T E E^T X (X^T X)^{-1} \]Taking expectation:
\[ E[(\hat{a} - a)(\hat{a} - a)^T] = (X^T X)^{-1} X^T E(E E^T) X (X^T X)^{-1} \]Under the assumption \(E(E E^T) = \sigma^2 I_n\):
\[ = (X^T X)^{-1} X^T (\sigma^2 I_n) X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} X^T X (X^T X)^{-1} \]\[ = \sigma^2 (X^T X)^{-1} I_p (X^T X)^{-1} \]\[ \boxed{\text{Var}(\hat{a}) = \sigma^2 (X^T X)^{-1}} \]Distribution Under Normality
If \(E \sim \mathcal{N}(0, \sigma^2 I)\), then:
\[ \boxed{\hat{a} \sim \mathcal{N}\left(a, \sigma^2 (X^T X)^{-1}\right)} \]This result is fundamental for:
- Hypothesis testing (t-tests for individual coefficients)
- Confidence intervals
- F-tests for nested models
Summary
These annexes provide the mathematical foundations for:
- ✅ OLS estimators for simple linear regression
- ✅ Normal equation for multiple regression
- ✅ Maximum likelihood estimation for logistic regression
- ✅ Distributional properties of coefficient estimators
Understanding these derivations helps build intuition for:
- Why OLS works
- When normal equations fail (singular \(X^T X\))
- Why logistic regression requires iterative optimization
- The role of normality assumptions in inference
Recommended Practice: Work through these derivations by hand to solidify your understanding of the mathematical foundations of linear models!