AI Foundations Check

Covers: ml-landscape, neural-networks, deep-learning, attention-mechanism, transformer-architecture, tokenization-embeddings

A team trains a model to classify product reviews as positive or negative using 10,000 labeled reviews. They also have 500,000 unlabeled reviews. The labeled-only model achieves 87% accuracy. Which approach best leverages the unlabeled data? Reinforcement learning — reward the model for correct predictions on unlabeled data Unsupervised learning — cluster the unlabeled reviews and use cluster assignments as labels Self-supervised pre-training on the unlabeled data to learn language representations, then fine-tuning on the labeled data for classification Train separate models on each dataset and average their outputs

Self-supervised pre-training (predicting masked words or next tokens) learns useful language representations from unlabeled text. Fine-tuning then adapts these representations to the classification task using the labeled data. This is the approach that made BERT and GPT effective — pre-train on vast unlabeled data, fine-tune on task-specific labeled data. Reinforcement learning requires a reward signal the unlabeled data does not provide. Clustering produces groupings, not sentiment labels. Averaging separate models does not transfer knowledge between datasets.

During training, a neural network's loss decreases on the training set for 50 epochs, but after epoch 20, the validation loss starts increasing while training loss continues to fall. A junior engineer proposes increasing the number of layers to improve performance. What is the actual problem, and why would the proposed solution make it worse? Underfitting — the model needs more capacity, so adding layers would actually help The learning rate is too high — the model is oscillating past the optimum Overfitting — the model is memorizing training data; adding layers increases capacity and would worsen the memorization Vanishing gradients — the loss function gradient is too small to update early layers

Training loss decreasing while validation loss increases is the classic signature of overfitting: the model is learning patterns specific to the training data that do not generalize. Adding more layers increases the model's capacity to memorize, making overfitting worse. The correct interventions are regularization (dropout, weight decay), early stopping (stop at epoch 20), more training data, or reducing model complexity. Vanishing gradients would cause training loss to plateau, not continue decreasing. An oscillating learning rate would show erratic loss, not smooth divergence.

RNNs were the dominant sequence model before transformers. Which two limitations of RNNs did the attention mechanism directly solve? RNNs could not handle variable-length inputs and required fixed-size vocabularies RNNs could not be trained with gradient descent and required evolutionary optimization Sequential processing prevented parallelization, and information between distant positions degraded through many intermediate steps RNNs could not learn from unlabeled data and required all training examples to be labeled

RNNs process tokens sequentially — each step depends on the previous — making parallelization across timesteps impossible and training slow. Information from early positions must propagate through every intermediate step to reach later positions, suffering gradient degradation (vanishing gradients) along the way. Self-attention solves both: all positions are processed in parallel, and every position directly attends to every other in a single step (path length 1 regardless of distance). RNNs can handle variable-length inputs, can be trained with gradient descent, and can use unsupervised objectives.

A transformer model with 70 billion parameters is trained on 2 trillion tokens. A research team plans to train a new model and must decide between 140 billion parameters on 2 trillion tokens or 70 billion parameters on 4 trillion tokens, with the same total compute budget. Based on established scaling laws, which choice is better and why? Double the parameters — larger models are always more capable regardless of training data Neither — both choices produce identical performance because the compute budget is the same Double the parameters — the scaling hypothesis shows parameters matter more than data Double the data — the Chinchilla scaling laws showed that most large models are undertrained relative to their size, and compute is better spent on more data than more parameters

The Chinchilla paper (Hoffmann et al., 2022) demonstrated that for a fixed compute budget, models should be trained on roughly 20 tokens per parameter. A 70B model should see roughly 1.4 trillion tokens. The original 70B/2T model is near optimal. Doubling parameters to 140B without more data creates an undertrained model — it has capacity it cannot effectively use. Doubling data to 4T for the 70B model provides more learning signal. Scaling laws show that parameters and data must scale together; overinvesting in one dimension gives diminishing returns.

A developer observes that their LLM correctly answers "What is 12 + 7?" (single token: "19") but fails on "What is 1247 + 893?" (multi-token answer). The model has been trained on millions of arithmetic examples. What is the most likely explanation? The model has insufficient parameters to learn arithmetic Tokenization splits large numbers into subword tokens, requiring the model to perform cross-token reasoning about digit positions rather than operating on whole numbers The training data did not include enough large-number examples The attention mechanism cannot process numerical values, only text

BPE tokenizers split numbers into subword tokens based on frequency in training text, not mathematical structure. "1247" might become ["12", "47"] or ["1", "247"] depending on the tokenizer. The model must then reason about digit positions across token boundaries — carrying digits, aligning place values — using representations that encode no numerical semantics. Small numbers that are single tokens can be memorized as lookup tables. Multi-token numbers require genuine cross-token arithmetic reasoning, which is much harder. This is a tokenization artifact, not a reasoning limitation: models with digit-level tokenization or specialized numerical representations perform better on arithmetic.