Deep Networks and Optimization

L-Layer Neural Networks

Deep learning earns its name from depth. Instead of stopping at one hidden layer, we repeat the same layer pattern several times.

For an \(L\)-layer network:

layers \(1\) to \(L-1\) usually use a hidden activation such as ReLU or tanh,
the final layer uses an output activation suited to the task.

This layered structure lets the network build features progressively, from simple transformations to more abstract ones.

We can summarize the whole network as

\[ A^{[L]} = f(x;\theta), \qquad \theta = \{W^{[1]}, b^{[1]}, \dots, W^{[L]}, b^{[L]}\}. \]

Initialization Matters

Training deep networks is sensitive to how weights are initialized.

If all weights start at zero:

neurons in the same layer learn the same function,
symmetry is never broken,
learning becomes ineffective.

If weights start too large:

activations can saturate,
gradients can vanish or explode,
optimization becomes unstable.

Your lecture highlights variance-preserving initialization, especially the Xavier-style idea of keeping activations and gradients on a reasonable scale across layers.

For example, Xavier-style initialization uses a variance of roughly

\[ \mathrm{Var}(W) \approx \frac{2}{n_{\mathrm{in}} + n_{\mathrm{out}}}, \]

while He initialization for ReLU networks is often written as

\[ \mathrm{Var}(W) \approx \frac{2}{n_{\mathrm{in}}}. \]

SGD and Mini-Batches

Instead of always computing gradients on the full dataset, we often use smaller subsets.

Batch gradient descent: stable but expensive
Stochastic gradient descent (SGD): updates after one example, fast but noisy
Mini-batch gradient descent: practical compromise used most often

An epoch is one full pass through the training data. The batch size controls how many examples are used per update.

For a mini-batch \(B\), the parameter update takes the familiar form

\[ \theta \leftarrow \theta - \alpha \nabla_\theta J_B(\theta). \]

Modern Optimizers

The lecture also reviews several optimizer families:

Momentum: smooths updates using an exponentially weighted average of past gradients
AdaGrad: adapts the learning rate per parameter, often useful for sparse settings
RMSprop: fixes AdaGrad's overly aggressive decay by using a moving average of squared gradients
Adam: combines momentum and adaptive scaling, which is why it is so widely used

In practice, Adam is a strong default, but SGD with momentum is still common when we want tighter control over generalization.

Momentum itself can be summarized as

\[ v_t = \beta v_{t-1} + (1-\beta)\nabla_\theta J(\theta_t), \qquad \theta_{t+1} = \theta_t - \alpha v_t. \]

Hyperparameters and Bias-Variance

Neural networks are highly sensitive to hyperparameters such as:

learning rate,
number of layers,
number of neurons per layer,
batch size,
number of epochs,
activation choice,
initialization method.

The same logic as in classical ML still applies:

high bias means the model is too weak or undertrained,
high variance means the model is too flexible or insufficiently regularized.

Summary

In this lesson we covered:

What an L-layer network means in practice
Why initialization is critical in deep models
The difference between batch, stochastic, and mini-batch gradient descent
The intuition behind Momentum, AdaGrad, RMSprop, and Adam
The main hyperparameters that shape training
Why bias and variance still guide model diagnosis in deep learning

Next: We will study the regularization tools that keep deep models stable and generalizable.

Introduction & Background

Simple Linear Regression

Inference & Diagnostic

Multiple Regression and Feature Engineering

Model Selection and Regularization

Generalized Linear Models (GLM) and Logistic Regression

Mathematical Annexes

Introduction and Partitioning

Splitting Criteria and Best Split

Growth Control and Pruning

Foundations of Model Evaluation

Metrics for Regression and Classification

Cross-Validation Strategies

Feature Selection and Preprocessing

Hyperparameter Tuning and Early Stopping

Imbalanced Data and Threshold Selection

Introduction and Hierarchical Clustering

K-means and K-medoids

Gaussian Mixtures and the EM Algorithm

Density-Based Clustering and Practical Guidance