← Large Language Models
LLMs Check
Covers: pretraining, fine-tuning, prompt-engineering, rag, agents-tools
A lab has a fixed compute budget and must choose between training a 130B parameter model on 260B tokens or a 65B parameter model on 1.3T tokens. According to the Chinchilla scaling findings, which approach produces a better model and why?
The 130B model, because larger models always outperform smaller ones at any training duration
The 65B model on 1.3T tokens, because Chinchilla showed compute-optimal training allocates roughly 20 tokens per parameter, and the 130B model is severely under-trained
Both produce equivalent models because total FLOPs are the same
The 130B model, because emergent abilities only appear above 100B parameters
The Chinchilla paper demonstrated that for a fixed compute budget, optimal performance comes from balancing model size and training data. The compute-optimal ratio is approximately 20 tokens per parameter. The 130B model trained on 260B tokens (2 tokens per parameter) is ten times under-trained. The 65B model trained on 1.3T tokens (20 tokens per parameter) matches the Chinchilla optimum and will produce lower loss despite having fewer parameters.
A team is choosing between RLHF and DPO to align a fine-tuned model. They have a dataset of 50K human preference pairs (response A preferred over response B) but limited engineering resources. Which approach is more practical and why?
RLHF, because it always produces better-aligned models than DPO regardless of resources
Neither — with only 50K preference pairs, they should use SFT instead
DPO, because it optimizes directly on preference pairs without requiring a separate reward model or PPO training loop, making it simpler to implement and more stable
RLHF, because DPO requires at least 500K preference pairs to work
DPO directly optimizes the policy on preference data without the two-stage process of training a reward model and then running PPO. This makes it significantly simpler to implement, more stable during training, and less demanding of engineering resources. RLHF can outperform DPO in some scenarios with a well-tuned reward model, but for a resource-constrained team with a standard preference dataset, DPO is the practical choice. 50K preference pairs is sufficient for DPO training.
A model answers "What is the capital of Australia?" correctly when asked directly (zero-shot) but fails a multi-step problem: "If a train travels 60 km/h for 2.5 hours, then 80 km/h for 1.5 hours, what is the total distance?" Which prompting technique is most likely to fix the multi-step failure?
Few-shot prompting with examples of capital city questions
Reducing the temperature to 0 for deterministic output
Chain-of-thought prompting, instructing the model to compute each leg of the journey step by step before summing
Increasing the context window size to give the model more working memory
Chain-of-thought prompting solves multi-step reasoning failures by having the model generate intermediate steps as tokens. Each step (60 * 2.5 = 150, then 80 * 1.5 = 120, then 150 + 120 = 270) becomes part of the context for the next step. Without CoT, the model must compute the entire answer in a single forward pass, which is unreliable for multi-step arithmetic. Few-shot examples of unrelated tasks do not help. Temperature affects randomness, not reasoning capability. Context window size is not the bottleneck — the model needs to show its work, not have more space.
A RAG system retrieves chunks using semantic search, but users report that searches for specific error codes (e.g., "ERR-4092") return irrelevant results about error handling in general. What architectural change would most directly fix this problem?
Switch to larger chunks to provide more context around each error code
Fine-tune the embedding model on error code documentation
Increase the number of retrieved chunks from 5 to 20
Add hybrid search combining BM25 keyword matching with semantic search, so exact-match queries like error codes use keyword retrieval
Semantic search maps text to meaning — "ERR-4092" and "error handling" are semantically similar even though they refer to different things. BM25 keyword search matches the exact string "ERR-4092" and will find documents containing that specific code. Hybrid search combines both methods: keyword search handles exact-match queries while semantic search handles conceptual queries. Larger chunks, more chunks, or fine-tuning the embedding model do not address the fundamental mismatch between semantic similarity and exact string matching.
A coding agent must fix a failing test. It reads the error, edits the code, runs the test, sees a new error, edits again, and runs the test again — succeeding on the second iteration. Each step in the agent loop succeeds with 90% probability. If the task requires 6 steps, what is the approximate probability the agent completes the task without any step failing?
90% — the probability of any single step
54% — because 0.9 to the power of 6 is approximately 0.53
53% — the overall reliability is 0.9^6, demonstrating how per-step reliability compounds
15% — because each failure compounds exponentially
Agent reliability is the product of per-step reliability. With 6 independent steps each succeeding at 90%, the overall probability is 0.9^6 = 0.531, approximately 53%. This demonstrates why agent reliability is a critical research area: even a high per-step success rate produces surprisingly low end-to-end reliability for multi-step tasks. Improving per-step reliability from 90% to 95% would raise the 6-step success rate from 53% to 74% (0.95^6 = 0.735).
Check answers