A pretrained model is a next-token predictor. It completes text. Ask it a question and it may generate more questions, or continue in the style of a web forum, or produce anything statistically likely to follow the prompt. Alignment transforms this raw predictor into a model that follows instructions, refuses harmful requests, and produces useful responses.
The first step after pretraining. Human contractors write thousands of example conversations: a user asks a question, the assistant provides a helpful response. The model trains on these demonstrations, learning the input-output format of an assistant rather than a document completer.
This is instruction tuning — the model learns to follow instructions rather than simply predict what text comes next. The training data is small relative to pretraining (tens of thousands of examples vs. trillions of tokens), but it fundamentally changes the model’s behavior. After SFT, the model responds to questions with answers instead of continuations.
SFT quality is bounded by the quality of the demonstrations. Early instruction-tuning datasets were written by crowd workers with variable expertise. The shift toward expert-written demonstrations (domain specialists, experienced writers) produced measurably better models. InstructGPT showed that a 1.3B model fine-tuned on high-quality demonstrations outperformed the 175B base model on human preference evaluations. Data quality dominates data quantity for SFT.
SFT teaches the model what a good response looks like by example. RLHF teaches the model what humans prefer by comparison. The process has two stages.
Reward model training: human raters compare pairs of model outputs for the same prompt and indicate which response is better. These preference judgments train a reward model — a neural network that scores any response on a quality scale.
Policy optimization: the LLM generates responses, the reward model scores them, and the LLM is updated via Proximal Policy Optimization (PPO) to produce higher-scoring outputs. A KL divergence penalty prevents the model from drifting too far from the SFT baseline, avoiding reward hacking — producing outputs that score high on the reward model but are actually low quality.
It is easier for humans to compare two responses than to write a perfect response from scratch. Comparison judgments scale better (faster, cheaper, more consistent across raters) and capture subtle preferences that are difficult to specify in a demonstration. A rater may not be able to write an expert-level physics explanation, but they can reliably identify which of two explanations is clearer, more accurate, and more complete.
RLHF also allows the model to explore the space of possible responses and find outputs that score well, rather than being constrained to imitate fixed demonstrations. The model can discover response patterns that no human demonstrator wrote.
DPO (2023) achieves the same objective as RLHF without training a separate reward model or running PPO. The key insight: the optimal policy under the RLHF objective has a closed-form relationship with the reward function. This means the policy can be optimized directly on the preference data.
Given a pair of responses where response A is preferred over response B, DPO increases the probability of A and decreases the probability of B, weighted by how surprising the preference is under the current model. No reward model, no reinforcement learning loop, no PPO hyperparameter tuning.
DPO is simpler to implement, more stable to train, and requires less compute. It has largely replaced RLHF at many labs for preference optimization. However, RLHF with a well-trained reward model can still outperform DPO on some tasks, particularly when the reward model captures preferences that are difficult to express as pairwise comparisons. Some labs use both: DPO for initial alignment, then RLHF for fine-grained optimization on specific capabilities.
Anthropic’s approach to alignment. Instead of relying solely on human raters, the model critiques and revises its own outputs according to a set of principles (a “constitution”). The process:
This reduces dependence on human raters for safety-related judgments and makes the alignment criteria explicit and auditable.
Constitutional AI addresses a fundamental challenge: as models become more capable, human raters struggle to evaluate outputs in domains where the model knows more than the rater. A human may not be able to tell whether a complex code snippet has a subtle security vulnerability. Constitutional AI partially addresses this by having the model itself participate in evaluation, though this introduces the risk of the model’s blind spots becoming self-reinforcing.
Full fine-tuning updates every parameter in the model — billions of weights. LoRA (Low-Rank Adaptation) freezes the pretrained weights and injects small trainable rank-decomposition matrices into each transformer layer. Instead of updating a weight matrix W directly, LoRA learns two small matrices A and B such that the effective update is BA, where the rank of BA is much smaller than the rank of W.
A 70B parameter model might require only 100M trainable parameters with LoRA — a 700x reduction. This makes fine-tuning feasible on a single GPU and enables serving multiple fine-tuned variants from the same base model by swapping lightweight adapters.
Fine-tuning (including LoRA) is appropriate when a model’s behavior must change consistently across many inputs — a specific output format, domain-specific knowledge, a particular tone or style. Prompting is appropriate when the desired behavior can be described in the context window. The decision tree: try prompting first. If prompting works but is unreliable or too expensive (long system prompts consume tokens), fine-tune. If the task requires knowledge not in the base model, fine-tune or use RAG.
Alignment improves safety and helpfulness but can reduce raw capability. A model trained to refuse harmful requests will sometimes refuse benign requests that superficially resemble harmful ones (over-refusal). A model trained to be concise may underperform on tasks requiring detailed analysis. A model trained to express uncertainty may hedge on questions it actually knows the answer to.
This tradeoff — the alignment tax — is a central tension in LLM development. Too little alignment produces a capable but dangerous model. Too much alignment produces a safe but frustrating model. Finding the right balance is an ongoing research problem.
Labs track both capability benchmarks (MMLU, HumanEval, math competitions) and safety benchmarks (refusal rates on harmful prompts, over-refusal rates on benign prompts) across alignment stages. The goal is a Pareto improvement — better safety without capability regression. In practice, alignment stages often trade small capability decrements for large safety improvements, but the trade-off is not zero.
This lesson establishes:
Next: Prompt Engineering