LLMs Check

A lab has a fixed compute budget and must choose between training a 130B parameter model on 260B tokens or a 65B parameter model on 1.3T tokens. According to the Chinchilla scaling findings, which approach produces a better model and why? The 130B model, because larger models always outperform smaller ones at any training duration The 65B model on 1.3T tokens, because Chinchilla showed compute-optimal training allocates roughly 20 tokens per parameter, and the 130B model is severely under-trained Both produce equivalent models because total FLOPs are the same The 130B model, because emergent abilities only appear above 100B parameters

The Chinchilla paper demonstrated that for a fixed compute budget, optimal performance comes from balancing model size and training data. The compute-optimal ratio is approximately 20 tokens per parameter. The 130B model trained on 260B tokens (2 tokens per parameter) is ten times under-trained. The 65B model trained on 1.3T tokens (20 tokens per parameter) matches the Chinchilla optimum and will produce lower loss despite having fewer parameters.

A team is choosing between RLHF and DPO to align a fine-tuned model. They have a dataset of 50K human preference pairs (response A preferred over response B) but limited engineering resources. Which approach is more practical and why? RLHF, because it always produces better-aligned models than DPO regardless of resources Neither — with only 50K preference pairs, they should use SFT instead DPO, because it optimizes directly on preference pairs without requiring a separate reward model or PPO training loop, making it simpler to implement and more stable RLHF, because DPO requires at least 500K preference pairs to work

DPO directly optimizes the policy on preference data without the two-stage process of training a reward model and then running PPO. This makes it significantly simpler to implement, more stable during training, and less demanding of engineering resources. RLHF can outperform DPO in some scenarios with a well-tuned reward model, but for a resource-constrained team with a standard preference dataset, DPO is the practical choice. 50K preference pairs is sufficient for DPO training.

A model answers "What is the capital of Australia?" correctly when asked directly (zero-shot) but fails a multi-step problem: "If a train travels 60 km/h for 2.5 hours, then 80 km/h for 1.5 hours, what is the total distance?" Which prompting technique is most likely to fix the multi-step failure? Few-shot prompting with examples of capital city questions Reducing the temperature to 0 for deterministic output Chain-of-thought prompting, instructing the model to compute each leg of the journey step by step before summing Increasing the context window size to give the model more working memory

Chain-of-thought prompting solves multi-step reasoning failures by having the model generate intermediate steps as tokens. Each step (60 × 2.5 = 150, then 80 × 1.5 = 120, then 150 + 120 = 270) becomes part of the context for the next step. Without CoT, the model must compute the entire answer in a single forward pass, which is unreliable for multi-step arithmetic. Few-shot examples of unrelated tasks do not help. Temperature affects randomness, not reasoning capability. Context window size is not the bottleneck — the model needs to show its work, not have more space.

A RAG system retrieves chunks using semantic search, but users report that searches for specific error codes (e.g., "ERR-4092") return irrelevant results about error handling in general. What architectural change would most directly fix this problem? Switch to larger chunks to provide more context around each error code Fine-tune the embedding model on error code documentation Increase the number of retrieved chunks from 5 to 20 Add hybrid search combining BM25 keyword matching with semantic search, so exact-match queries like error codes use keyword retrieval

Semantic search maps text to meaning — "ERR-4092" and "error handling" are semantically similar even though they refer to different things. BM25 keyword search matches the exact string "ERR-4092" and will find documents containing that specific code. Hybrid search combines both methods: keyword search handles exact-match queries while semantic search handles conceptual queries. Larger chunks, more chunks, or fine-tuning the embedding model do not address the fundamental mismatch between semantic similarity and exact string matching.

A coding agent must fix a failing test. It reads the error, edits the code, runs the test, sees a new error, edits again, and runs the test again — succeeding on the second iteration. Each step in the agent loop succeeds with 90% probability. If the task requires 6 steps, what is the approximate probability the agent completes the task without any step failing? 90% — the probability of any single step 54% — because 0.9 to the power of 6 is approximately 0.53 53% — the overall reliability is 0.9^6, demonstrating how per-step reliability compounds 15% — because each failure compounds exponentially

Agent reliability is the product of per-step reliability. With 6 independent steps each succeeding at 90%, the overall probability is 0.9^6 = 0.531, approximately 53%. This demonstrates why agent reliability is a critical research area: even a high per-step success rate produces surprisingly low end-to-end reliability for multi-step tasks. Improving per-step reliability from 90% to 95% would raise the 6-step success rate from 53% to 74% (0.95^6 = 0.735).