Introduction & Background

Welcome to Modeling with Linear & Generalized Linear Models! In this first module we will:

  1. Review the notation and basic statistical concepts that will recur throughout the course.
  2. Introduce our running motivating example: predicting house prices using regression.

By the end of this lesson you should feel comfortable with symbols like \(y\), \(\mathbf{x}\), \(\boldsymbol\beta\), and be ready to formulate and fit your first linear model.

Throughout the course we will use the following conventions:

Symbol Meaning
\(n\) Number of observations (data points)
\(p\) Number of predictors (features)
\(\mathbf{x}_i\) The \(p\)-vector of features for observation \(i\)
\(y_i\) The response (target) for observation \(i\)
\(\mathbf{X}\) The \(n\times p\) design matrix whose \(i\)th row is \(\mathbf{x}_i^\top\)
\(\boldsymbol\beta\) The \(p\)-vector of unknown model coefficients
\(\varepsilon_i\) Random error term for observation \(i\)
\(\bar x\) observed/sample means for random variables \(x\)
\(s^2 \) observed/sample variance for random variables \(x\)

Before fitting models, let’s recall some foundational ideas:

  • A random variable \(Y\) has a probability distribution.
  • The expected value (mean) is
    \[ \mathbb{E}[Y] = \mu = \int y\,dF_Y(y) \]
  • The variance measures spread:
    \[ \mathrm{Var}(Y) = \mathbb{E}\bigl[(Y - \mu)^2\bigr] = \sigma^2. \]
  • The sample mean of \(y_1,\dots,y_n\): \[ \bar y = \frac1n\sum_{i=1}^n y_i. \]
  • The sample variance: \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar y)^2. \]
  • Covariance of two random variables \(X\) and \(Y\): \[ \mathrm{Cov}(X,Y) = \mathbb{E}\bigl[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\bigr]. \]
  • Correlation (Pearson’s \(\rho\)) is the standardized covariance: \[ \rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(Y)}}. \]

We will see how these quantities enter into formulas for parameter estimates, tests, and confidence intervals.

To ground our discussion, we’ll work throughout the course with a real-world dataset of house sales. Our goal:

Predict the sale price \(y\) of a home from features such as:

  • Size (\(\text{sqft}\))
  • Number of bedrooms
  • Age of the home
  • Neighborhood quality (that will be encoded as dummy variables)

In notation:

\[ y_i \;=\; \beta_0 + \beta_1\,\text{sqft}_i + \beta_2\,\text{beds}_i + \beta_3\,\text{age}_i + \cdots + \varepsilon_i. \]

Here is a sample of 10 lines:

Size Number of bedrooms Age Neighborhood Price
1710 3 5 CollgCr 208500
1262 3 31 Veenker 181500
1786 3 7 CollgCr 223500
1717 3 91 Crawfor 140000
2198 4 8 NoRidge 250000
1362 1 16 Mitchel 143000
1694 3 3 Somerst 307000
2090 3 36 NWAmes 200000
1774 2 77 OldTown 129900
1077 2 69 BrkSide 118000

Here’s your scatter plot of Price (y) vs. Size (x):

Later modules will show how to:

  • Estimate \(\boldsymbol\beta\) (Ordinary Least Squares)
  • Test hypotheses about feature effects (t-tests, F-tests)
  • Check model assumptions (residual diagnostics)
  • Extend to generalized linear models (e.g., logistic regression)

With our notation and basic statistics clear, we’re ready to dive in to Simple Linear Regression in the next lesson!