Introduction & Background

Overview

Welcome to Modeling with Linear & Generalized Linear Models! In this first module we will:

Review the notation and basic statistical concepts that will recur throughout the course.
Introduce our running motivating example: predicting house prices using regression.

By the end of this lesson you should feel comfortable with symbols like \(y\), \(\mathbf{x}\), \(\boldsymbol\beta\), and be ready to formulate and fit your first linear model.

Notation

Throughout the course we will use the following conventions:

Symbol	Meaning
\(n\)	Number of observations (data points)
\(p\)	Number of predictors (features)
\(\mathbf{x}_i\)	The \(p\)-vector of features for observation \(i\)
\(y_i\)	The response (target) for observation \(i\)
\(\mathbf{X}\)	The \(n\times p\) design matrix whose \(i\)th row is \(\mathbf{x}_i^\top\)
\(\boldsymbol\beta\)	The \(p\)-vector of unknown model coefficients
\(\varepsilon_i\)	Random error term for observation \(i\)
\(\bar x\)	observed/sample means for random variables \(x\)
\(s^2 \)	observed/sample variance for random variables \(x\)

Before fitting models, let’s recall some foundational ideas:

Random Variables & Expectations

A random variable \(Y\) has a probability distribution.
The expected value (mean) is
\[ \mathbb{E}[Y] = \mu = \int y\,dF_Y(y) \]
The variance measures spread:
\[ \mathrm{Var}(Y) = \mathbb{E}\bigl[(Y - \mu)^2\bigr] = \sigma^2. \]

Sampling & Estimation

The sample mean of \(y_1,\dots,y_n\): \[ \bar y = \frac1n\sum_{i=1}^n y_i. \]
The sample variance: \[ s^2 = \frac{1}{n-1}\sum_{i=1}^n (y_i - \bar y)^2. \]

Covariance & Correlation

Covariance of two random variables \(X\) and \(Y\): \[ \mathrm{Cov}(X,Y) = \mathbb{E}\bigl[(X - \mathbb{E}[X])(Y - \mathbb{E}[Y])\bigr]. \]
Correlation (Pearson’s \(\rho\)) is the standardized covariance: \[ \rho_{XY} = \frac{\mathrm{Cov}(X,Y)}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(Y)}}. \]

We will see how these quantities enter into formulas for parameter estimates, tests, and confidence intervals.

Motivating Example: House Price Prediction

To ground our discussion, we’ll work throughout the course with a real-world dataset of house sales. Our goal:

Predict the sale price \(y\) of a home from features such as:

Size (\(\text{sqft}\))

Number of bedrooms

Age of the home

Neighborhood quality (that will be encoded as dummy variables)

In notation:

\[ y_i \;=\; \beta_0 + \beta_1\,\text{sqft}_i + \beta_2\,\text{beds}_i + \beta_3\,\text{age}_i + \cdots + \varepsilon_i. \]

Here is a sample of 10 lines:

Size	Number of bedrooms	Age	Neighborhood	Price
1710	3	5	CollgCr	208500
1262	3	31	Veenker	181500
1786	3	7	CollgCr	223500
1717	3	91	Crawfor	140000
2198	4	8	NoRidge	250000
1362	1	16	Mitchel	143000
1694	3	3	Somerst	307000
2090	3	36	NWAmes	200000
1774	2	77	OldTown	129900
1077	2	69	BrkSide	118000

Here’s your scatter plot of Price (y) vs. Size (x):

Later modules will show how to:

Estimate \(\boldsymbol\beta\) (Ordinary Least Squares)
Test hypotheses about feature effects (t-tests, F-tests)
Check model assumptions (residual diagnostics)
Extend to generalized linear models (e.g., logistic regression)

With our notation and basic statistics clear, we’re ready to dive in to Simple Linear Regression in the next lesson!