The internal machinery of the modern hash function — chunks, trees, and a 4×4 grid of mixing.
Step 1
Instead of processing data one block at a time like SHA-256, BLAKE3 starts by slicing your entire input into 1 KB chunks (1,024 bytes each). Think of it as taking a long scroll and cutting it into uniform pages. Each page can be processed independently — you don't need to wait for page 1 to finish before starting page 2.
Follow the data
Your message
Any length — could be a tweet or a movie file
Split into 1 KB chunks
Chunk 0 • Chunk 1 • Chunk 2 • Chunk 3 • ...
Compress each independently
All chunks can run at the same time on different cores
Why 1 KB?
It's a sweet spot. Too small (like SHA-256's 64-byte blocks) and you spend too much time on overhead. Too large and you lose parallelism for small files. At 1 KB, even a 16 KB file gets split into 16 chunks that can run on 16 cores simultaneously.
Step 2
Inside each chunk, BLAKE3 sets up a 4×4 grid of 32-bit values — 16 "cups" arranged in rows and columns. Each round, the algorithm stirs them in two patterns: first down the columns, then along the diagonals. This ensures every value mixes with every other value quickly. It only needs 7 rounds because the mixing function (called G) is powerful enough to do in 7 what SHA-256 needs 64 rounds to achieve.
The 4×4 state matrix
Row 1: chaining values. Row 2: chaining values. Row 3: constants. Row 4: counter, block length, flags.
7 rounds of mixing
7 rounds — each one mixes columns, then diagonals
Apply the G function to each of the 4 columns: (v0, v4, v8, v12), (v1, v5, v9, v13), (v2, v6, v10, v14), (v3, v7, v11, v15). Each column gets two message words mixed in.
Apply G to the 4 diagonals: (v0, v5, v10, v15), (v1, v6, v11, v12), (v2, v7, v8, v13), (v3, v4, v9, v14). This cross-cuts the columns so every value influences every other.
Before the next round, the 16 message words are shuffled into a new order. This means each round mixes in different words at different positions — maximizing diffusion.
After 7 rounds, the top two rows of the matrix (v0-v7) are XORed with the bottom two rows (v8-v15). The result is the 256-bit chaining value for this chunk.
Why only 7 rounds?
The G function inside BLAKE3 does more work per round than SHA-256's compression step. It uses addition, XOR, and bitwise rotation — three different operations that interact in ways that spread changes faster. Cryptographers have analyzed this and confirmed that 7 rounds provides a strong security margin.
Step 3
After each chunk is compressed into a chaining value, BLAKE3 combines them in pairs — like a sports tournament bracket. Chunk 0's output pairs with chunk 1's. Chunk 2 pairs with chunk 3. The winners pair up again, and again, until there's one root hash at the top. Any branch of the tree can be computed independently.
Chunk outputs combine upward
8 chunks → 4 parent nodes → 2 parent nodes → 1 root hash. Each level can be computed in parallel.
In SHA-256's chain, block 10 must wait for blocks 1-9 to finish. In BLAKE3's tree, chunks 0 and 1 are independent, chunks 2 and 3 are independent — every pair is independent. This is why BLAKE3 scales linearly with the number of cores: 4 cores ≈ 4× faster, 16 cores ≈ 16× faster.
Fun fact
The parent node compression uses the same compression function as the leaf chunks — but with a PARENT flag so BLAKE3 knows it's combining two chaining values, not raw data. One function does everything.
Step 4
Since chunks are processed out of order and in parallel, BLAKE3 needs a way to tell them apart. Each chunk gets a counter (its position in the input) and flags (its role). These are baked into the bottom row of the 4×4 state matrix before compression, so even identical data at different positions produces different outputs.
A 64-bit number that counts which chunk this is: chunk 0, chunk 1, chunk 2, and so on. It goes into v12 and v13 in the state matrix. This means the same data at position 5 hashes differently than the same data at position 9.
The first block of a chunk gets a CHUNK_START flag. The last block gets CHUNK_END. If a chunk is only one block, it gets both. This tells the compression function where boundaries are.
When combining two chaining values in the Merkle tree, the PARENT flag is set. This distinguishes tree-internal operations from leaf-level chunk processing.
The very last compression — whether it's a single chunk or the top of a tree — gets the ROOT flag. This is what prevents length extension attacks: only the root node produces the final output, and it's processed differently.
KEYED_HASH, DERIVE_KEY_CONTEXT, and DERIVE_KEY_MATERIAL flags activate BLAKE3's other modes. Same compression function, different flag — different purpose.
Fun fact
The counter is why BLAKE3 doesn't need the padding that SHA-256 requires. Since every chunk knows its position and length, there's no ambiguity about where the message ends — the CHUNK_END flag and the block length field handle it.
Step 5
Most hash functions do one thing: hash. SHA-256 gives you 256 bits, always. BLAKE3 can be a hash, a keyed hash (MAC), a key derivation function (KDF), and an extendable output function (XOF) — all from the same core. The only difference is which flags are set and what goes into the initial state.
Standard mode. Input goes in, a fixed-length (or extendable) hash comes out. The initial state uses the standard IV constants.
Like a hash, but the initial state is loaded with a secret key instead of the IV. This lets you prove that a message is authentic — only someone with the key can produce the same hash. With SHA-256, you need a separate algorithm (HMAC) for this.
Takes a password or master key and derives sub-keys from it. BLAKE3 hashes the context string first (like "my-app-2024-encryption-key"), then uses that as the key to hash the input material. Two separate derive steps ensure domain separation.
Most hash functions output a fixed number of bytes. BLAKE3 can output as many bytes as you want. It does this by repeatedly running the root compression with an incrementing counter, producing 64 new bytes each time. Need 32 bytes? Take the first 32. Need 1,000? Keep going.
Extendable output: same input, any length
The root node can be "squeezed" for as many bytes as needed — no limit.
Why does this matter?
With SHA-256, you need different tools for different jobs: SHA-256 for hashing, HMAC-SHA-256 for authentication, HKDF-SHA-256 for key derivation. With BLAKE3, it's all one algorithm with one security proof. Fewer moving parts means fewer things that can go wrong.
The takeaway
BLAKE3 was built for the world we live in now — multi-core processors, massive files, and the need for one tool that does many jobs. Split into chunks, compress them all at once, combine with a tree, stamp each piece with its position and role, and squeeze out as many bytes as you need. It's not just a faster SHA-256 — it's a fundamentally different architecture for the same problem.