← Large Language Models

LLMs Check

Covers: pretraining, fine-tuning, prompt-engineering, rag, agents-tools

A lab has a fixed compute budget and must choose between training a 130B parameter model on 260B tokens or a 65B parameter model on 1.3T tokens. According to the Chinchilla scaling findings, which approach produces a better model and why?
The Chinchilla paper demonstrated that for a fixed compute budget, optimal performance comes from balancing model size and training data. The compute-optimal ratio is approximately 20 tokens per parameter. The 130B model trained on 260B tokens (2 tokens per parameter) is ten times under-trained. The 65B model trained on 1.3T tokens (20 tokens per parameter) matches the Chinchilla optimum and will produce lower loss despite having fewer parameters.
A team is choosing between RLHF and DPO to align a fine-tuned model. They have a dataset of 50K human preference pairs (response A preferred over response B) but limited engineering resources. Which approach is more practical and why?
DPO directly optimizes the policy on preference data without the two-stage process of training a reward model and then running PPO. This makes it significantly simpler to implement, more stable during training, and less demanding of engineering resources. RLHF can outperform DPO in some scenarios with a well-tuned reward model, but for a resource-constrained team with a standard preference dataset, DPO is the practical choice. 50K preference pairs is sufficient for DPO training.
A model answers "What is the capital of Australia?" correctly when asked directly (zero-shot) but fails a multi-step problem: "If a train travels 60 km/h for 2.5 hours, then 80 km/h for 1.5 hours, what is the total distance?" Which prompting technique is most likely to fix the multi-step failure?
Chain-of-thought prompting solves multi-step reasoning failures by having the model generate intermediate steps as tokens. Each step (60 * 2.5 = 150, then 80 * 1.5 = 120, then 150 + 120 = 270) becomes part of the context for the next step. Without CoT, the model must compute the entire answer in a single forward pass, which is unreliable for multi-step arithmetic. Few-shot examples of unrelated tasks do not help. Temperature affects randomness, not reasoning capability. Context window size is not the bottleneck — the model needs to show its work, not have more space.
A RAG system retrieves chunks using semantic search, but users report that searches for specific error codes (e.g., "ERR-4092") return irrelevant results about error handling in general. What architectural change would most directly fix this problem?
Semantic search maps text to meaning — "ERR-4092" and "error handling" are semantically similar even though they refer to different things. BM25 keyword search matches the exact string "ERR-4092" and will find documents containing that specific code. Hybrid search combines both methods: keyword search handles exact-match queries while semantic search handles conceptual queries. Larger chunks, more chunks, or fine-tuning the embedding model do not address the fundamental mismatch between semantic similarity and exact string matching.
A coding agent must fix a failing test. It reads the error, edits the code, runs the test, sees a new error, edits again, and runs the test again — succeeding on the second iteration. Each step in the agent loop succeeds with 90% probability. If the task requires 6 steps, what is the approximate probability the agent completes the task without any step failing?
Agent reliability is the product of per-step reliability. With 6 independent steps each succeeding at 90%, the overall probability is 0.9^6 = 0.531, approximately 53%. This demonstrates why agent reliability is a critical research area: even a high per-step success rate produces surprisingly low end-to-end reliability for multi-step tasks. Improving per-step reliability from 90% to 95% would raise the 6-step success rate from 53% to 74% (0.95^6 = 0.735).