Hash Functions

Fingerprints for data. A cryptographic hash function takes an arbitrary amount of input and produces a fixed-size output that is, for all practical purposes, unique to that input.

Mental Model: The Fingerprint Machine

Feed in any amount of data — a single byte, a 10 GB file, the entire contents of a database — and the machine produces a fixed-size fingerprint. 256 bits for SHA-256. Always the same size, regardless of input.

Same input always produces the same fingerprint. Different inputs produce different fingerprints — not with absolute certainty (the pigeonhole principle guarantees collisions exist), but with overwhelming probability. Finding two inputs that produce the same hash should be computationally infeasible.

The Avalanche Effect

Change a single bit of the input, and the output changes completely — roughly half the bits flip. There is no observable relationship between similar inputs and their hashes. This property makes hash functions useful for integrity checking: any modification, no matter how small, produces an entirely different hash.

The Three Security Properties

A cryptographic hash function must satisfy three progressively stronger guarantees:

Preimage resistance: given a hash h, it is infeasible to find any input m such that hash(m) = h. The function cannot be reversed. This is the one-way property.

Second preimage resistance: given an input m1, it is infeasible to find a different input m2 such that hash(m1) = hash(m2). No substitute that hashes to the same value can be found.

Collision resistance: it is infeasible to find any two distinct inputs m1 and m2 such that hash(m1) = hash(m2). No pair of colliding inputs can be found.

Why Three Properties, Not One

These properties serve different threat models. Preimage resistance protects password hashes — an attacker who steals hash(password) cannot recover the password. Second preimage resistance protects document integrity — an attacker cannot swap a signed document for a forgery with the same hash. Collision resistance protects certificate authorities — an attacker cannot craft two certificates (one benign, one malicious) with the same hash and get the benign one signed.

The MD5 collision attack against certificate authorities in 2008 demonstrated why collision resistance matters: researchers created a rogue CA certificate that collided with a legitimate one, allowing them to forge certificates for any domain.

The Birthday Paradox

Collision resistance is harder to achieve than it appears. Due to the birthday paradox, finding a collision in an n-bit hash requires roughly 2^(n/2) operations, not 2^n. A 128-bit hash (like MD5) has a collision resistance of only 2^64 — within reach of modern hardware. This is why SHA-256 (with 2^128 collision resistance) is the minimum acceptable hash length for security-critical applications.

SHA-256 and the Hash Landscape

SHA-256: the current standard. 256-bit output. Used in Bitcoin proof-of-work, TLS certificate fingerprints, git (migrating from SHA-1), and virtually every system requiring cryptographic integrity. Part of the SHA-2 family designed by the NSA.

SHA-1: 160-bit output. Broken — Google/CWI produced the first collision (SHAttered) in 2017. Still used in legacy systems but should not be trusted for security.

MD5: 128-bit output. Thoroughly broken. Collisions can be generated in seconds on a laptop. Do not use for any security purpose.

SHA-3 (Keccak): selected by NIST in 2012 as an alternative to SHA-2. Completely different internal design (sponge construction vs Merkle-Damgard). Not widely adopted because SHA-2 remains unbroken, but provides a fallback if SHA-2 weaknesses are ever discovered.

BLAKE3: The Performance Frontier

BLAKE3 is a cryptographic hash function designed for speed. It is parallelizable by construction — the internal tree structure means hashing can scale across cores. BLAKE3 is faster than MD5 while providing 256-bit security. It is not yet a NIST standard, but it is increasingly adopted for performance-critical applications where SHA-256’s sequential design is a bottleneck.

The Software Perspective

git: every object — commit, tree, blob — is identified by the SHA hash of its content. This is content-addressable storage. Two identical files always produce the same hash. Git is migrating from SHA-1 to SHA-256 precisely because SHA-1 collision resistance is broken.

Docker: image layer digests are SHA-256 hashes. When an image is pulled, the registry verifies each layer’s hash against the manifest. Tampered layers are detected and rejected.

Subresource Integrity (SRI): when a third-party script is included via CDN, SRI allows the expected SHA hash to be specified. If the CDN is compromised and serves a modified script, the browser rejects it because the hash doesn’t match.

Password Hashing Is Different

General-purpose hash functions like SHA-256 are designed to be fast. That is the opposite of what password hashing requires, where speed helps the attacker. Password hashing functions — bcrypt, scrypt, argon2 — are deliberately slow and memory-hard, making brute-force attacks expensive. Never use SHA-256 directly for password storage.

Key Takeaways

This lesson establishes:

The distinction between preimage resistance, second preimage resistance, and collision resistance
Why MD5 and SHA-1 are no longer acceptable for security-critical use
The birthday paradox and its impact on collision resistance
Why password hashing requires different properties than data integrity hashing
Examples of hash functions in production systems: git, Docker, and Subresource Integrity

Next: MACs and Authenticated Encryption