Simple Linear Regression

Linear regression is a prediction model that establishes a linear relationship between a target variable and a set of explanatory variables.

The case of one explanatory variable is called simple linear regression:

\[ \hat{Y} = aX + b \]

The Gauss–Markov theorem states that the ordinary least squares (OLS) estimator has the lowest sampling variance within the class of linear unbiased estimators, if the errors in the linear regression model are:

  • Uncorrelated
  • Have equal variances (homoscedasticity)
  • Have expectation value of zero

The full model with error term:

\[ Y = aX + b + \varepsilon \]

where \(\varepsilon\) represents the random error.


We need to minimize the sum of squared residuals:

\[ \sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2 = \sum_{i=1}^{n} (Y_i - (aX_i + b))^2 \]

By setting the partial derivatives to zero, we find:

\[ \hat{a} = \frac{\text{Cov}(X,Y)}{\text{Var}(X)} = \frac{\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \]\[ \hat{b} = \bar{Y} - \hat{a}\,\bar{X} \]

where:

  • \(\bar{X}\) and \(\bar{Y}\) are the sample means
  • \(\text{Cov}(X,Y)\) is the covariance between \(X\) and \(Y\)
  • \(\text{Var}(X)\) is the variance of \(X\)

The empirical covariance is defined as:

\[ \text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y}) \]

Covariance signifies the direction of the linear relationship between two variables:

  • Positive covariance: variables tend to move in the same direction
  • Negative covariance: variables tend to move in opposite directions
  • Note: \(\text{Cov}(X,X) = \text{Var}(X)\)

Correlation refers to the scaled form of covariance:

\[ \text{Cor}(X,Y) = \frac{\text{Cov}(X,Y)}{s_X \cdot s_Y} \]

where \(s_X\) and \(s_Y\) are the standard deviations of \(X\) and \(Y\).

Properties:

  • Correlation is dimensionless and ranges from \(-1\) to \(+1\)
  • \(|\text{Cor}(X,Y)| = 1\) indicates a perfect linear relationship
  • \(\text{Cor}(X,Y) = 0\) indicates no linear relationship

A statistical error \(\varepsilon_i\) is the amount by which an observation differs from its expected value:

\[ Y_i = aX_i + b + \varepsilon_i \quad \Rightarrow \quad \varepsilon_i = Y_i - (aX_i + b) \]

A residual \(r_i\) is the observable estimate of the unobservable statistical error:

\[ r_i = \hat{\varepsilon}_i = Y_i - (\hat{a}X_i + \hat{b}) = Y_i - \hat{Y}_i \]

Key difference:

  • Errors \(\varepsilon_i\) involve the true (unknown) parameters \(a\) and \(b\)
  • Residuals \(r_i\) involve the estimated parameters \(\hat{a}\) and \(\hat{b}\)

The regression line minimizes the sum of squared vertical distances (residuals) from each point to the line.


In this lesson we covered:

  1. ✅ The definition of simple linear regression
  2. ✅ The Gauss–Markov theorem and OLS properties
  3. ✅ How to derive coefficient estimates using calculus
  4. ✅ The relationship between covariance and correlation
  5. ✅ The distinction between errors and residuals

Next: We'll explore how to assess the quality of our fit, test hypotheses about coefficients, and diagnose potential problems through residual analysis.