A neural network is a function built from simple, repeated building blocks. Each block does almost nothing — multiply, add, squash. Stack enough of them in the right structure, and the composite function can approximate virtually any mapping from inputs to outputs.
A perceptron takes multiple inputs, multiplies each by a learned weight, sums the results, adds a bias term, and passes the total through an activation function. That is the entire computation: output = activation(w1x1 + w2x2 + … + wn*xn + b).
The weights determine how much each input matters. The bias shifts the decision boundary. The activation function introduces nonlinearity — without it, stacking layers would be pointless because a chain of linear transformations is still linear.
A single perceptron can learn a linear decision boundary. It can solve AND and OR but not XOR. This limitation — discovered in 1969 by Minsky and Papert — nearly killed the field. The solution was adding layers.
The perceptron was originally inspired by biological neurons: inputs (dendrites), weighted combination (cell body), threshold firing (axon). But modern neural networks have diverged far from neuroscience. The analogy is historical, not functional. Treating neural networks as brain simulations leads to incorrect intuitions. They are differentiable function approximators, nothing more.
Stack perceptrons in layers: an input layer, one or more hidden layers, and an output layer. Each neuron in one layer connects to every neuron in the next (a fully connected or dense layer). The output of one layer becomes the input to the next.
Width (neurons per layer) determines the capacity of each layer — how many features it can represent simultaneously. Depth (number of layers) determines the level of abstraction — deeper layers learn combinations of combinations.
A network with one hidden layer of sufficient width can theoretically approximate any continuous function (the universal approximation theorem). But in practice, deeper networks learn more efficiently — they require exponentially fewer parameters than shallow networks for the same representational power.
Consider image recognition. The first layer learns edges — horizontal, vertical, diagonal. The second layer learns combinations of edges — corners, curves, textures. The third layer learns combinations of those — eyes, wheels, letters. The fourth learns combinations of those — faces, cars, words. Each layer builds increasingly abstract representations from the layer below. This hierarchical decomposition is why deep networks are so effective — they mirror the compositional structure of the real world.
Without activation functions, a neural network is just a linear transformation — useless for complex problems. Activation functions introduce the nonlinearity that makes deep learning work.
ReLU (Rectified Linear Unit): output = max(0, x). If the input is positive, pass it through unchanged. If negative, output zero. Simple, fast to compute, and works remarkably well in practice. The default choice for hidden layers.
Sigmoid: output = 1 / (1 + e^(-x)). Squashes any input to the range (0, 1). Useful for binary classification outputs (interpret as probability). Problematic in deep networks because gradients vanish for very large or small inputs — the function is nearly flat at the extremes.
Softmax: generalizes sigmoid to multiple classes. Takes a vector of raw scores and converts them to probabilities that sum to 1. Used in the output layer for classification: if the model must choose among 10 categories, softmax produces a probability distribution over all 10.
Sigmoid and tanh activations saturate — for large positive or negative inputs, the gradient approaches zero. During backpropagation, gradients are multiplied across layers. If each layer contributes a near-zero gradient, the product vanishes exponentially with depth. Early layers receive negligible updates and effectively stop learning.
ReLU mitigates this: for positive inputs, the gradient is always 1. But ReLU has its own problem — “dead neurons” that output zero for all inputs and never recover. Variants like Leaky ReLU and GELU address this by allowing small gradients for negative inputs.
The loss function measures how wrong the model’s predictions are. Training is the process of minimizing this function. Different problems require different loss functions:
Mean Squared Error (MSE): for regression problems. Measures the average squared difference between predicted and actual values. Penalizes large errors heavily.
Cross-entropy loss: for classification problems. Measures how different the predicted probability distribution is from the true distribution. Produces larger gradients when the model is confidently wrong, driving faster learning.
The choice of loss function determines what the model optimizes for. A model trained with MSE will behave differently from one trained with mean absolute error — MSE cares more about outliers because of the squaring.
Visualize the loss as a surface over the space of all possible parameter values. Training is navigating this surface to find a low point. For simple models, the surface is a smooth bowl with a single minimum. For deep networks, the surface is a high-dimensional terrain with ridges, valleys, saddle points, and many local minima. Understanding this landscape is key to understanding why some training strategies work and others fail.
Gradient descent is the optimization algorithm that adjusts parameters to minimize the loss. The gradient of the loss function points in the direction of steepest increase. Move in the opposite direction — downhill — and the loss decreases.
Learning rate: how big a step to take. Too large: overshoot the minimum, oscillate, or diverge. Too small: training takes forever and may get stuck in local minima. This single hyperparameter is often the most important tuning decision.
Stochastic gradient descent (SGD): compute the gradient on a small random batch of training data (a mini-batch) instead of the full dataset. This is noisier but much faster — and the noise actually helps escape shallow local minima. Virtually all neural network training uses mini-batch SGD or its variants.
Adam: the most popular optimizer in practice. Combines momentum (keep moving in the direction of recent progress) with adaptive learning rates (adjust the step size per-parameter based on the history of gradients). Adam converges faster than vanilla SGD for most problems.
A constant learning rate is rarely optimal. Common strategies: start large (explore broadly) and decay over time (settle into a minimum). Warmup: start very small, increase gradually, then decay — this stabilizes training for very large models. Cosine annealing: oscillate the learning rate following a cosine curve, occasionally increasing it to escape local minima.
Backpropagation is the algorithm that computes how much each parameter contributed to the loss. It applies the chain rule of calculus repeatedly, propagating error signals backward from the output layer to the input layer.
Forward pass: input flows through the network, producing a prediction. The loss function computes the error. Backward pass: the error signal flows backward. At each layer, backpropagation computes two things — the gradient of the loss with respect to that layer’s parameters (used to update the parameters) and the gradient with respect to that layer’s input (passed to the previous layer).
This is computationally efficient — it visits each connection exactly once per pass. Without backpropagation, computing gradients for a network with millions of parameters would be intractable.
In practice, nobody implements backpropagation by hand. Frameworks like PyTorch and JAX use automatic differentiation — they build a computational graph during the forward pass and mechanically apply the chain rule to compute all gradients. This means defining the forward computation is enough, and the framework generates the backward pass automatically. This is what makes rapid experimentation with novel architectures possible.
A neural network with a single hidden layer of sufficient width can approximate any continuous function on a compact domain to arbitrary precision. This theorem, proven in 1989, establishes that neural networks are theoretically powerful enough for any continuous problem.
But it says nothing about how to find the right parameters (training might fail), how many neurons are needed (possibly an impractical number), or whether the network will generalize (it might just memorize). The theorem guarantees existence, not feasibility. In practice, deep networks with multiple layers work better than wide shallow ones — they learn more efficiently and generalize better.
Deep learning has a theory gap. Practitioners know empirically that deeper networks train faster, generalize better, and require fewer parameters than theory predicts. The loss landscape of overparameterized networks is smoother than expected — there are few bad local minima. SGD acts as an implicit regularizer, biasing toward simpler solutions. These observations are empirically robust but theoretically not fully understood. Deep learning works better than current theory says it should.
This lesson establishes: