← AI Foundations

AI Foundations Check

Covers: ml-landscape, neural-networks, deep-learning, attention-mechanism, transformer-architecture, tokenization-embeddings

A team trains a model to classify product reviews as positive or negative using 10,000 labeled reviews. They also have 500,000 unlabeled reviews. The labeled-only model achieves 87% accuracy. Which approach best leverages the unlabeled data?
Self-supervised pre-training (predicting masked words or next tokens) learns useful language representations from unlabeled text. Fine-tuning then adapts these representations to the classification task using the labeled data. This is the approach that made BERT and GPT effective — pre-train on vast unlabeled data, fine-tune on task-specific labeled data. Reinforcement learning requires a reward signal the unlabeled data does not provide. Clustering produces groupings, not sentiment labels. Averaging separate models does not transfer knowledge between datasets.
During training, a neural network's loss decreases on the training set for 50 epochs, but after epoch 20, the validation loss starts increasing while training loss continues to fall. A junior engineer proposes increasing the number of layers to improve performance. What is the actual problem, and why would the proposed solution make it worse?
Training loss decreasing while validation loss increases is the classic signature of overfitting: the model is learning patterns specific to the training data that do not generalize. Adding more layers increases the model's capacity to memorize, making overfitting worse. The correct interventions are regularization (dropout, weight decay), early stopping (stop at epoch 20), more training data, or reducing model complexity. Vanishing gradients would cause training loss to plateau, not continue decreasing. An oscillating learning rate would show erratic loss, not smooth divergence.
RNNs were the dominant sequence model before transformers. Which two limitations of RNNs did the attention mechanism directly solve?
RNNs process tokens sequentially — each step depends on the previous — making parallelization across timesteps impossible and training slow. Information from early positions must propagate through every intermediate step to reach later positions, suffering gradient degradation (vanishing gradients) along the way. Self-attention solves both: all positions are processed in parallel, and every position directly attends to every other in a single step (path length 1 regardless of distance). RNNs can handle variable-length inputs, can be trained with gradient descent, and can use unsupervised objectives.
A transformer model with 70 billion parameters is trained on 2 trillion tokens. A research team plans to train a new model and must decide between 140 billion parameters on 2 trillion tokens or 70 billion parameters on 4 trillion tokens, with the same total compute budget. Based on established scaling laws, which choice is better and why?
The Chinchilla paper (Hoffmann et al., 2022) demonstrated that for a fixed compute budget, models should be trained on roughly 20 tokens per parameter. A 70B model should see roughly 1.4 trillion tokens. The original 70B/2T model is near optimal. Doubling parameters to 140B without more data creates an undertrained model — it has capacity it cannot effectively use. Doubling data to 4T for the 70B model provides more learning signal. Scaling laws show that parameters and data must scale together; overinvesting in one dimension gives diminishing returns.
A developer observes that their LLM correctly answers "What is 12 + 7?" (single token: "19") but fails on "What is 1247 + 893?" (multi-token answer). The model has been trained on millions of arithmetic examples. What is the most likely explanation?
BPE tokenizers split numbers into subword tokens based on frequency in training text, not mathematical structure. "1247" might become ["12", "47"] or ["1", "247"] depending on the tokenizer. The model must then reason about digit positions across token boundaries — carrying digits, aligning place values — using representations that encode no numerical semantics. Small numbers that are single tokens can be memorized as lookup tables. Multi-token numbers require genuine cross-token arithmetic reasoning, which is much harder. This is a tokenization artifact, not a reasoning limitation: models with digit-level tokenization or specialized numerical representations perform better on arithmetic.