Traditional software follows explicit rules: if this, then that. Machine learning inverts the process — examples are provided, and the system discovers the rules. This distinction is the dividing line between classical programming and modern AI.
Classical programming: rules are written that map inputs to outputs. The programmer encodes domain knowledge directly. Machine learning: input-output pairs are provided, and an algorithm finds the mapping function automatically. The “learning” is the process of adjusting internal parameters until the function approximates the true relationship in the data.
This is not intelligence in the human sense. It is function approximation at scale. The power comes from the fact that many real-world problems are too complex for humans to write explicit rules — speech recognition, image classification, language translation — but can be solved by learning from enough examples.
In 2009, Google researchers published a paper arguing that simple algorithms with massive datasets outperform sophisticated algorithms with small datasets. This insight — that scale often matters more than cleverness — foreshadowed the scaling laws that drive modern LLM development. The best algorithm with 1,000 examples often loses to a mediocre algorithm with 10 million examples.
Supervised learning: the training data includes both inputs and correct outputs (labels). The model learns to predict the output given new inputs. Examples: spam detection (email -> spam/not spam), house price prediction (features -> price), image classification (pixels -> label). This is the workhorse of applied ML.
Unsupervised learning: the training data has no labels. The model discovers structure on its own. Examples: clustering customers by purchasing behavior, dimensionality reduction for visualization, anomaly detection. The model finds patterns humans did not explicitly specify.
Reinforcement learning: an agent takes actions in an environment and receives rewards or penalties. The model learns a policy — which action to take in each state — by maximizing cumulative reward. Examples: game playing (AlphaGo), robotics, recommendation systems. The training signal is sparse and delayed, making RL fundamentally harder than supervised learning.
Most real-world data is unlabeled — labeling is expensive. Semi-supervised learning uses a small amount of labeled data and a large amount of unlabeled data. Self-supervised learning creates labels from the data itself: mask a word in a sentence and predict it (BERT), or predict the next token (GPT). Self-supervised learning is the paradigm that enabled modern LLMs — the internet is an effectively unlimited source of self-supervised training data.
A model that memorizes training data is useless — it needs to generalize to data it has never seen. The standard approach splits data into three sets:
Training set: the model learns from this data by adjusting its parameters. Validation set: used during training to tune hyperparameters and detect overfitting. The model never trains on this data, but decisions are made based on validation performance. Test set: touched exactly once, at the end, to estimate real-world performance. Peeking at the test set during development contaminates the estimate.
Overfitting: the model learns noise and idiosyncrasies in the training data instead of the true underlying pattern. It performs well on training data but poorly on new data. A model that memorizes every training example has overfit perfectly. Signs: training accuracy far exceeds validation accuracy.
Underfitting: the model is too simple to capture the underlying pattern. It performs poorly on both training and new data. A linear model trying to fit a quadratic relationship will underfit. Signs: both training and validation accuracy are poor.
Every model’s error decomposes into three components:
Bias: error from simplifying assumptions. A linear model has high bias when the true relationship is nonlinear — it systematically misses the pattern. High bias causes underfitting.
Variance: error from sensitivity to training data fluctuations. A very complex model fits different training sets very differently — it captures noise as if it were signal. High variance causes overfitting.
Irreducible error: noise inherent in the problem that no model can eliminate.
The tradeoff: reducing bias typically increases variance and vice versa. Simple models have high bias, low variance. Complex models have low bias, high variance. The goal is the sweet spot — complex enough to capture the real pattern, simple enough to ignore the noise.
Before deep learning, most of ML practice was feature engineering — transforming raw data into representations that help the model learn. Converting dates to day-of-week, extracting edges from images, computing rolling averages from time series. This was domain expertise encoded manually.
Deep learning largely automates feature engineering — neural networks learn representations directly from raw data. But feature engineering is not dead. In tabular data (most business applications), well-engineered features with gradient-boosted trees still often outperform deep learning.
ML works well when: there is a pattern to learn, sufficient data exists to learn it, the pattern is too complex for explicit rules, the problem tolerates probabilistic (not guaranteed) answers, and the training distribution matches the deployment distribution.
ML is the wrong tool when: the problem has clear, expressible rules (regular programming applies), the training data does not represent the real world (distribution shift), errors have catastrophic consequences with no human oversight, the model’s decisions cannot be explained and explainability is required, or there is too little data to learn a reliable pattern.
A model trained on data from one distribution and deployed on data from another will fail silently — it produces confident but wrong answers. A fraud detection model trained on 2019 data will not understand 2020 pandemic-era spending patterns. A medical imaging model trained on one hospital’s equipment may not generalize to another’s. This is the most common failure mode in production ML, and it is rarely caught by standard metrics.
This lesson establishes: