A fully connected network treats every input feature as equally related to every other. That is wasteful and often wrong. Real data has structure — spatial structure in images, sequential structure in language, hierarchical structure in graphs. Specialized architectures encode these structural assumptions directly, allowing the network to learn more from less data.
A Convolutional Neural Network applies small filters (kernels) that slide across the input, detecting local patterns regardless of position. A 3x3 filter that detects vertical edges works the same whether the edge is in the top-left or bottom-right of the image. This is translation invariance — the architecture’s key inductive bias.
Convolution: a small filter slides across the input, computing a dot product at each position. The output is a feature map highlighting where the pattern occurs. Multiple filters detect multiple patterns. The first layer might learn 64 different edge detectors.
Pooling: downsamples feature maps, reducing spatial dimensions while preserving the detected patterns. Max pooling keeps the strongest activation in each region. This provides slight translation invariance and reduces computation.
Stacking: convolution layers build on each other. Layer 1 detects edges. Layer 2 detects combinations of edges (textures, corners). Layer 3 detects objects. The hierarchy emerges automatically from training.
In 2012, AlexNet — a CNN — won the ImageNet image classification competition with a 10-percentage-point margin over traditional methods. This was the result that convinced the field that deep learning worked at scale. AlexNet used ReLU activations, dropout regularization, and GPU training — techniques now standard. Within three years, CNNs surpassed human-level accuracy on ImageNet. The architecture itself was not new (LeCun’s 1998 LeNet worked on similar principles), but scale — more data, more compute, more depth — made the difference.
A Recurrent Neural Network processes sequences one element at a time, maintaining a hidden state that carries information from previous steps. At each timestep, the network takes the current input and the previous hidden state, producing an output and an updated hidden state. This gives the network memory — it can, in principle, use information from arbitrarily far in the past.
In practice, vanilla RNNs struggle with long sequences. The vanishing gradient problem hits hard: gradients shrink exponentially as they propagate backward through many timesteps, making it nearly impossible to learn dependencies that span more than roughly 10-20 steps.
LSTMs (Long Short-Term Memory): solve the vanishing gradient problem with a gating mechanism. Three gates — forget, input, output — control what information the cell state retains, adds, and exposes. The cell state acts as a conveyor belt, carrying information across many timesteps with minimal gradient degradation.
GRUs (Gated Recurrent Units): a simplified variant with two gates instead of three. Often performs comparably to LSTMs with fewer parameters. The choice between them is usually empirical.
RNNs process tokens one at a time, in order. This means training cannot be parallelized across timesteps — each step depends on the previous one. For long sequences, this is prohibitively slow. A 1,000-token sequence requires 1,000 sequential operations regardless of available hardware. This sequential bottleneck is the fundamental limitation that the transformer architecture was designed to overcome.
An autoencoder is trained to reconstruct its input. That sounds trivial — but the trick is forcing the data through a bottleneck. The encoder compresses the input into a lower-dimensional representation (the latent space). The decoder reconstructs the original from this compressed form. If reconstruction is accurate, the bottleneck representation must capture the essential information.
This is representation learning — the network discovers a compact, useful encoding of the data without supervision. The latent space organizes data by similarity: similar inputs map to nearby points.
Variational Autoencoders (VAEs): constrain the latent space to follow a known distribution (typically Gaussian). This makes the latent space smooth and continuous, enabling generation — sample a point from the latent space and decode it to produce a new, plausible data point. VAEs are generative models: they learn the data distribution, not just a compression.
Autoencoders automate what feature engineering does manually. Instead of a domain expert deciding which features to extract, the network learns which features matter for faithful reconstruction. This idea — learning useful representations from raw data — is arguably the central contribution of deep learning. Every modern architecture, from BERT to DALL-E, is fundamentally a representation learning system.
A Generative Adversarial Network pits two networks against each other. The generator creates fake data (images, text, audio). The discriminator tries to distinguish real data from fake. Both improve through competition: the generator learns to produce more realistic fakes; the discriminator learns to catch subtler forgeries.
When training converges (which is notoriously difficult), the generator produces data indistinguishable from real samples. GANs produced the first photorealistic AI-generated faces and drove early image synthesis research.
Training instability: GANs are hard to train. Mode collapse — the generator produces only a few types of output, ignoring the full data distribution. Oscillation — the generator and discriminator cycle without converging. These difficulties led to extensive research on training stabilization (WGAN, spectral normalization, progressive growing) and ultimately to the rise of diffusion models as a more stable alternative.
Diffusion models (Stable Diffusion, DALL-E 2, Midjourney) largely replaced GANs for image generation by 2023. Instead of adversarial training, diffusion models learn to reverse a gradual noise-adding process — start with pure noise, iteratively denoise it into an image. Training is stable (just predict the noise at each step), the models cover the full data distribution, and quality scales reliably with compute. The lesson: architectural choices that make training stable and scalable tend to win, even if an alternative is theoretically more elegant.
Every architecture encodes assumptions about the structure of its data. CNNs assume spatial locality and translation invariance — appropriate for images, wrong for tabular data. RNNs assume sequential ordering — appropriate for time series, inefficient for data with long-range dependencies. Autoencoders assume the data lies on a lower-dimensional manifold. GANs assume an adversarial game can capture the data distribution.
These assumptions are inductive biases. The right inductive bias allows a model to learn from less data and generalize better. The wrong one forces the model to fight its architecture. Choosing the right architecture for the problem — or designing one with appropriate biases — is one of the most important decisions in deep learning.
The transformer architecture (covered in the next lessons) has remarkably few inductive biases compared to CNNs and RNNs. It does not assume spatial locality, sequential processing, or fixed-range dependencies. This makes it less efficient on small datasets (it needs more data to learn what CNNs and RNNs get for free from their structure) but more flexible and scalable. When data is abundant, minimal bias wins — and the internet provides abundant data for language. This is a key reason transformers dominate modern AI.
This lesson establishes:
Next: The Attention Mechanism