CNN Fundamentals
Why Fully Connected Nets Struggle on Images
Image data is large and structured. A single \(1280 \times 720\) image already contains hundreds of thousands of pixel values, and nearby pixels are much more related than distant ones.
A fully connected network treats every input position as unrelated, which leads to:
- too many parameters,
- high computational cost,
- and easier overfitting.
CNNs solve this by exploiting spatial structure directly.
Two Core CNN Ideas
The lecture emphasizes two big ideas:
- Sparse connectivity: a neuron only looks at a local receptive field.
- Parameter sharing: the same filter is reused across many spatial positions.
Together, these choices drastically reduce parameters while preserving the ability to detect useful local patterns such as edges and textures.
If \(X\) is an input image and \(K\) is a filter, a convolutional response can be written schematically as
\[ (X * K)(i,j) = \sum_u \sum_v X(i+u, j+v)\,K(u,v). \]Filters and Feature Detection
A filter or kernel slides across the image and responds strongly when a certain pattern appears.
Early filters often resemble edge detectors. Deeper filters become more abstract because they operate on feature maps created by earlier layers.
Stride, Padding, and Output Size
Three hyperparameters shape the behavior of a convolutional layer:
- filter size
- stride
- padding
- Larger stride moves the filter farther each step.
- Padding helps preserve border information.
- Output dimensions depend on the input size, filter size, stride, and padding.
For one spatial dimension, the standard output-size formula is
\[ n_{\mathrm{out}} = \left\lfloor \frac{n_{\mathrm{in}} + 2p - f}{s} \right\rfloor + 1, \]where \(f\) is the filter size, \(s\) is the stride, and \(p\) is the padding.
Convolution Over Volume
Real images usually have multiple channels, such as RGB. A convolutional filter therefore spans not only height and width, but also depth.
If we apply many filters, we produce many output feature maps, which increases the depth of the next representation.
So a convolution layer changes not only spatial dimensions, but also the number of channels.
Pooling
Pooling layers reduce spatial size while keeping the most important information.
Max pooling is the most common example:
- it lowers computation,
- reduces sensitivity to small shifts,
- and helps control overfitting.
Summary
In this lesson we covered:
- Why fully connected networks are inefficient on images
- Sparse connectivity and parameter sharing
- Filters as learned local pattern detectors
- The roles of stride and padding
- Convolution across RGB volumes
- Why pooling helps efficiency and robustness
Next: We finish this first DL block with two classic CNN architectures: LeNet and AlexNet.