Regularization, Batch Normalization, and Multiclass Outputs

When a deep network overfits, we usually respond by reducing variance rather than making the model simpler immediately.

The main tools highlighted in your lecture are:

  • L1 and L2 regularization,
  • dropout,
  • better initialization,
  • data augmentation,
  • early stopping,
  • batch normalization.

Regularization penalizes large weights in the objective:

  • L2 encourages smaller, smoother weights
  • L1 can push some weights toward sparsity

A common L2-regularized objective is

\[ J_{\mathrm{reg}}(\theta) = J(\theta) + \frac{\lambda}{2m}\sum_{l=1}^{L}\|W^{[l]}\|_F^2. \]

In practice, L2 is the more common default in deep-learning training pipelines. The regularization strength \(\lambda\) is a hyperparameter and should be tuned on validation data.

Dropout randomly deactivates neurons during training. This prevents the network from relying too heavily on one specific pathway.

Key points:

  • apply dropout during training,
  • do not apply it at test time in the same way,
  • rescale activations appropriately so training and inference stay compatible.

With inverted dropout, we often write

\[ \tilde A^{[l]} = \frac{D^{[l]} \odot A^{[l]}}{\text{keep\_prob}}, \]

where \(D^{[l]}\) is a random mask of zeros and ones.

Dropout is powerful, but it can also make optimization noisier if the rate is too aggressive.

Batch normalization stabilizes intermediate activations by normalizing them inside mini-batches, then learning a scale and shift afterward.

Its main benefits are:

  • smoother optimization,
  • faster convergence,
  • some regularization effect,
  • better control of exploding or vanishing behavior.

For one batch, the core normalization step is

\[ \hat z = \frac{z - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}, \qquad y = \gamma \hat z + \beta. \]

When the output has more than two classes, sigmoid is no longer enough. We usually use softmax, which converts logits into a probability distribution over classes:

\[ \mathrm{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}} \]

This is typically paired with cross-entropy loss for multi-class learning.

In this lesson we covered:

  1. The main regularization tools used to reduce variance in deep learning
  2. The roles of L1 and L2 penalties
  3. Why dropout improves robustness
  4. How batch normalization stabilizes training
  5. Why softmax is the natural output layer for multi-class classification

Next: We now move to convolutional neural networks, the architecture family that transformed modern computer vision.