ELI5 — Deep Dive

How BLAKE3
actually works

The internal machinery of the modern hash function — chunks, trees, and a 4×4 grid of mixing.

Split into chunks

📜

Slicing a scroll into pages

Instead of processing data one block at a time like SHA-256, BLAKE3 starts by slicing your entire input into 1 KB chunks (1,024 bytes each). Think of it as taking a long scroll and cutting it into uniform pages. Each page can be processed independently — you don't need to wait for page 1 to finish before starting page 2.

Follow the data

Your message

Any length — could be a tweet or a movie file

Split into 1 KB chunks

Chunk 0 • Chunk 1 • Chunk 2 • Chunk 3 • ...

Compress each independently

All chunks can run at the same time on different cores

Why 1 KB?

It's a sweet spot. Too small (like SHA-256's 64-byte blocks) and you spend too much time on overhead. Too large and you lose parallelism for small files. At 1 KB, even a 16 KB file gets split into 16 chunks that can run on 16 cores simultaneously.

The compression function

🎛️

A 4×4 grid, stirred two ways

Inside each chunk, BLAKE3 sets up a 4×4 grid of 32-bit values — 16 "cups" arranged in rows and columns. Each round, the algorithm stirs them in two patterns: first down the columns, then along the diagonals. This ensures every value mixes with every other value quickly. It only needs 7 rounds because the mixing function (called G) is powerful enough to do in 7 what SHA-256 needs 64 rounds to achieve.

The 4×4 state matrix

v0
v1
v2
v3
v4
v5
v6
v7
v8
v9
v10
v11
v12
v13
v14
v15

Row 1: chaining values. Row 2: chaining values. Row 3: constants. Row 4: counter, block length, flags.

7 rounds of mixing

7 rounds — each one mixes columns, then diagonals

1

Column mixing

Apply the G function to each of the 4 columns: (v0, v4, v8, v12), (v1, v5, v9, v13), (v2, v6, v10, v14), (v3, v7, v11, v15). Each column gets two message words mixed in.

2

Diagonal mixing

Apply G to the 4 diagonals: (v0, v5, v10, v15), (v1, v6, v11, v12), (v2, v7, v8, v13), (v3, v4, v9, v14). This cross-cuts the columns so every value influences every other.

3

Permute the message

Before the next round, the 16 message words are shuffled into a new order. This means each round mixes in different words at different positions — maximizing diffusion.

4

Output the chaining value

After 7 rounds, the top two rows of the matrix (v0-v7) are XORed with the bottom two rows (v8-v15). The result is the 256-bit chaining value for this chunk.

Why only 7 rounds?

The G function inside BLAKE3 does more work per round than SHA-256's compression step. It uses addition, XOR, and bitwise rotation — three different operations that interact in ways that spread changes faster. Cryptographers have analyzed this and confirmed that 7 rounds provides a strong security margin.

The Merkle tree

🏆

A tournament bracket

After each chunk is compressed into a chaining value, BLAKE3 combines them in pairs — like a sports tournament bracket. Chunk 0's output pairs with chunk 1's. Chunk 2 pairs with chunk 3. The winners pair up again, and again, until there's one root hash at the top. Any branch of the tree can be computed independently.

Chunk outputs combine upward

Root
P01
P23
P0
P1
P2
P3
C0
C1
C2
C3
C4
C5
C6
C7

8 chunks → 4 parent nodes → 2 parent nodes → 1 root hash. Each level can be computed in parallel.

Why trees beat chains

In SHA-256's chain, block 10 must wait for blocks 1-9 to finish. In BLAKE3's tree, chunks 0 and 1 are independent, chunks 2 and 3 are independent — every pair is independent. This is why BLAKE3 scales linearly with the number of cores: 4 cores ≈ 4× faster, 16 cores ≈ 16× faster.

Fun fact

The parent node compression uses the same compression function as the leaf chunks — but with a PARENT flag so BLAKE3 knows it's combining two chaining values, not raw data. One function does everything.

Counter & flags

📬

Every page has a stamp

Since chunks are processed out of order and in parallel, BLAKE3 needs a way to tell them apart. Each chunk gets a counter (its position in the input) and flags (its role). These are baked into the bottom row of the 4×4 state matrix before compression, so even identical data at different positions produces different outputs.

1

Counter

A 64-bit number that counts which chunk this is: chunk 0, chunk 1, chunk 2, and so on. It goes into v12 and v13 in the state matrix. This means the same data at position 5 hashes differently than the same data at position 9.

2

CHUNK_START / CHUNK_END

The first block of a chunk gets a CHUNK_START flag. The last block gets CHUNK_END. If a chunk is only one block, it gets both. This tells the compression function where boundaries are.

3

PARENT

When combining two chaining values in the Merkle tree, the PARENT flag is set. This distinguishes tree-internal operations from leaf-level chunk processing.

4

ROOT

The very last compression — whether it's a single chunk or the top of a tree — gets the ROOT flag. This is what prevents length extension attacks: only the root node produces the final output, and it's processed differently.

5

Mode flags

KEYED_HASH, DERIVE_KEY_CONTEXT, and DERIVE_KEY_MATERIAL flags activate BLAKE3's other modes. Same compression function, different flag — different purpose.

Fun fact

The counter is why BLAKE3 doesn't need the padding that SHA-256 requires. Since every chunk knows its position and length, there's no ambiguity about where the message ends — the CHUNK_END flag and the block length field handle it.

One algorithm, many jobs

🛠️

The Swiss Army knife of hashing

Most hash functions do one thing: hash. SHA-256 gives you 256 bits, always. BLAKE3 can be a hash, a keyed hash (MAC), a key derivation function (KDF), and an extendable output function (XOF) — all from the same core. The only difference is which flags are set and what goes into the initial state.

1

Hash (default)

Standard mode. Input goes in, a fixed-length (or extendable) hash comes out. The initial state uses the standard IV constants.

2

Keyed hash (MAC)

Like a hash, but the initial state is loaded with a secret key instead of the IV. This lets you prove that a message is authentic — only someone with the key can produce the same hash. With SHA-256, you need a separate algorithm (HMAC) for this.

3

Key derivation (KDF)

Takes a password or master key and derives sub-keys from it. BLAKE3 hashes the context string first (like "my-app-2024-encryption-key"), then uses that as the key to hash the input material. Two separate derive steps ensure domain separation.

4

Extendable output (XOF)

Most hash functions output a fixed number of bytes. BLAKE3 can output as many bytes as you want. It does this by repeatedly running the root compression with an incrementing counter, producing 64 new bytes each time. Need 32 bytes? Take the first 32. Need 1,000? Keep going.

Extendable output: same input, any length

BLAKE3( data )
32 bytes (default)
64 bytes
128 bytes
any length you want...

The root node can be "squeezed" for as many bytes as needed — no limit.

Why does this matter?

With SHA-256, you need different tools for different jobs: SHA-256 for hashing, HMAC-SHA-256 for authentication, HKDF-SHA-256 for key derivation. With BLAKE3, it's all one algorithm with one security proof. Fewer moving parts means fewer things that can go wrong.

The whole picture

🌳

Parallel by design

BLAKE3 was built for the world we live in now — multi-core processors, massive files, and the need for one tool that does many jobs. Split into chunks, compress them all at once, combine with a tree, stamp each piece with its position and role, and squeeze out as many bytes as you need. It's not just a faster SHA-256 — it's a fundamentally different architecture for the same problem.