Simple Linear Regression
What is Linear Regression?
Linear regression is a prediction model that establishes a linear relationship between a target variable and a set of explanatory variables.
The case of one explanatory variable is called simple linear regression:
\[ \hat{Y} = aX + b \]The Gauss–Markov Theorem
The Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are:
- Uncorrelated
- Have equal variances (homoscedasticity)
- Have expectation value of zero
The full model with error term:
\[ Y = aX + b + \varepsilon \]where \(\varepsilon\) represents the random error.
Ordinary Least Squares (OLS)
Objective Function
We need to minimize the sum of squared residuals:
\[ \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - (aX_i + b))^2 \]Solution
By setting the partial derivatives to zero, we find:
\[ \hat{a} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \]\[ \hat{b} = \bar{Y} - \hat{a}\,\bar{X} \]where:
- \(\bar{X}\) and \(\bar{Y}\) are the sample means
- \(\text{Cov}(X,Y)\) is the covariance between \(X\) and \(Y\)
- \(\text{Var}(X)\) is the variance of \(X\)
Covariance and Correlation
Empirical Covariance
The empirical covariance is defined as:
\[ \text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y}) \]Covariance signifies the direction of the linear relationship between two variables:
- Positive covariance: variables tend to move in the same direction
- Negative covariance: variables tend to move in opposite directions
- Note: \(\text{Cov}(X,X) = \text{Var}(X)\)
Correlation
Correlation refers to the scaled form of covariance:
\[ \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y} \]where \(s_X\) and \(s_Y\) are the standard deviations of \(X\) and \(Y\).
Properties:
- Correlation is dimensionless and ranges from \(-1\) to \(+1\)
- \(|\text{Cor}(X,Y)| = 1\) indicates a perfect linear relationship
- \(\text{Cor}(X,Y) = 0\) indicates no linear relationship
Errors and Residuals
Statistical Error (Disturbance)
A statistical error \(\varepsilon_i\) is the amount by which an observation differs from its expected value:
\[ Y_i = aX_i + b + \varepsilon_i \quad \Rightarrow \quad \varepsilon_i = Y_i - (aX_i + b) \]Residual
A residual \(r_i\) is the observable estimate of the unobservable statistical error:
\[ r_i = \hat{\varepsilon}_i = Y_i - (\hat{a}X_i + \hat{b}) = Y_i - \hat{Y}_i \]Key difference:
- Errors \(\varepsilon_i\) involve the true (unknown) parameters \(a\) and \(b\)
- Residuals \(r_i\) involve the estimated parameters \(\hat{a}\) and \(\hat{b}\)
Visualizing OLS Fit
The regression line minimizes the sum of squared vertical distances (residuals) from each point to the line.
Summary
In this lesson we covered:
- ✅ The definition of simple linear regression
- ✅ The Gauss–Markov theorem and OLS properties
- ✅ How to derive coefficient estimates using calculus
- ✅ The relationship between covariance and correlation
- ✅ The distinction between errors and residuals
Next: We'll explore how to assess the quality of our fit, test hypotheses about coefficients, and diagnose potential problems through residual analysis.