A deep dive into how the LZ4 decompression algorithm can be implemented on the Z80, 8080, 8086, and 6502, exploring the architectural quirks that shape each version, the shared API, and the broader lessons for low‑level programming on 8‑ and 16‑bit systems.
Comparing an LZ4 Decompressor on Four Legacy CPUs

Introduction – why LZ4 on a SNES matters
When I first needed to squeeze a few kilobytes of sprite data into a SNES cartridge, I turned to the LZ4 algorithm, which was already popular in modern desktop tools. The SNES’s limited RAM and the fact that the decompressor runs on a 65816 gave me a chance to experiment with shortcuts that would not be feasible on a full‑blown PC. Those shortcuts turned out to be surprisingly portable: the same ideas work on the Z80, the Intel 8080/8086 family, and the MOS 6502. This article documents the four hand‑crafted assembly implementations, explains why the algorithm meshes so well with each processor, and extracts the broader design principles that any hobbyist working on 8‑ or 16‑bit platforms can apply.
A quick refresher on LZ4’s block format
LZ4 belongs to the LZ77 family, which encodes a stream as alternating literal runs and back‑references. A sequence consists of:
- A single length byte. The high nibble (
lenLit) gives the number of literals, the low nibble (lenRef) gives the length of the back‑reference minus four (the algorithm never emits a reference shorter than four bytes because the overhead would outweigh the savings). lenLitliteral bytes that are copied verbatim to the output.- A two‑byte little‑endian offset that tells the decoder how far back in the output buffer to start copying.
- Optional extra‑length bytes if either nibble is
0xF. Those extra bytes are added until a byte smaller than0xFFis encountered.
Two decoder‑side constraints simplify matters for retro hardware:
- The final sequence is always a literal run, which means the compressed stream can be terminated by a zero offset (
0x00 0x00). The decoder treats that as “stop”. - Offsets are never zero for any other sequence, so a zero offset can safely be used as a sentinel.
These rules let us avoid a full‑blown frame header and keep the state machine tiny.
How the algorithm maps to each CPU
The Zilog Z80 – the natural home for LZ4
The Z80’s LDIR instruction copies a block of bytes from [HL] to [DE] while incrementing both pointers automatically. LZ4 needs exactly that: a source pointer (the compressed stream) and a destination pointer (the output buffer). The processor also provides two general‑purpose registers (BC, AF) that can hold temporary counters and flags. The overall flow is therefore:
- Load the length byte into
A. - Split the nibbles with a rotate‑right‑four (
RRCA× 4) and mask. - If
lenLit> 0, call a helper that runsLDIRfor the literal count. - Read the 16‑bit offset, test for the zero‑sentinel, and compute the back‑reference address as
DE‑BC. - Use
LDIRagain for the back‑reference copy, adjusting the length with the extra‑byte reader.
Because LDIR already handles the pointer arithmetic, the Z80 version is the most compact and the fastest of the four.
The Intel 8080 – a close cousin with fewer instructions
The 8080 lacks LDIR and the 16‑bit subtraction instruction SBC HL,BC. The implementation therefore replaces those with:
- A hand‑written byte‑wise copy loop (
MOV A,M/STAX D/INX H/INX D). - Two 8‑bit subtractions (
SUB C/SBB B) to computeDE‑BC.
The rest of the logic mirrors the Z80 code: rotate‑right‑four to extract nibbles, a small helper to read extra length bytes, and a stack‑based temporary store for the source pointer when the back‑reference length exceeds 19 bytes. The extra stack traffic costs a few cycles, but the overall structure stays recognizably the same.
The Intel 8086 – 16‑bit power with string instructions
The 8086 brings a richer register file (AX, BX, CX, DX, SI, DI) and the string family (LODSB, STOSB, MOVSB, REP MOVSB). Those instructions combine load, store, and pointer increment in a single opcode, so the inner copy loops shrink dramatically. The implementation uses:
LODSBto fetch the length byte and advanceSI.SHRto obtain the high nibble,ANDto mask the low nibble.REP MOVSBwithCXas the repeat count for both literal and back‑reference copies.DXas a temporary register to hold the back‑reference address whileSIandDIpoint to the source and destination buffers.
Because the 8086 can address the full 1 MiB of memory with far pointers (DS:SI, ES:DI), the routine also preserves segment registers on entry and restores them on exit, keeping the calling convention clean.
The MOS Technology 6502 – the outlier
The 6502 has only three general‑purpose registers (A, X, Y) and a tiny 256‑byte stack. It cannot hold a 16‑bit pointer in a register, so the implementation stores the source and destination pointers in zero‑page variables (srcLo/srcHi, dstLo/dstHi). The key challenges are:
- Copying blocks – there is no
LDIRanalogue. A helper called.ldirperforms a nested loop: the low byte of the counter lives inX, the high byte in a RAM location. The loop decrementsXand, when it wraps, increments the high‑byte counter. Each iteration copies a byte from[src+Y]to[dst+Y]and incrementsY. - Back‑reference arithmetic – the subtraction
dst‑offsetis done with the carry flag (SEC/SBC). The result is written to a temporary RAM location (bksrc). - Length extension – the extra‑byte reader (
.rdlen) works exactly like the other CPUs but usesXas a temporary holder for the accumulator because the 6502 lacks a conditional return.
Even though the 6502 version is the longest, it stays faithful to the spirit of the algorithm: all work is expressed as simple byte‑wise operations, and the code never needs to push more than a handful of registers.
Shared API across the four implementations
| CPU | Source pointer | Destination pointer | Scratch registers | Return values |
|---|---|---|---|---|
| Z80 | HL |
DE |
BC, AF |
HL and DE point past the last processed byte |
| 8080 | HL (as H/L) |
DE |
BC, AF |
Same as Z80 |
| 8086 | DS:SI (far) |
ES:DI |
AX, BX, CX, DX |
SI and DI updated on return |
| 6502 | Zero‑page src/dst (4 bytes each) |
Zero‑page src/dst |
A, X, Y |
Pointers left in zero‑page variables |
The API was deliberately kept identical for the three Intel‑style CPUs, which made the translation from Z80 to 8080 to 8086 almost mechanical. The 6502 required a different calling convention, but the logical flow—read length byte, copy literals, read offset, copy back‑reference—remains unchanged.
What the exercise teaches us about low‑level design
- Let the hardware dictate the algorithmic granularity. The Z80’s
LDIRencourages a “copy‑as‑large‑as‑possible” mindset, while the 6502 forces you to think in terms of byte‑by‑byte loops. When you respect those natural grain sizes, the code stays tight. - Minimise state pressure. All four decoders keep only two long‑lived pointers. Anything else is stored temporarily on the stack (8080/8086) or in a few zero‑page bytes (6502). This reduces register spilling and makes the inner loops easier for the CPU to pipeline.
- Exploit sentinel values. Using a zero offset as the end‑of‑stream marker eliminates the need for a separate length field, a trick that works on any architecture that can test a 16‑bit word for zero in a single instruction (or two cheap byte tests).
- Reuse helper functions. The extra‑length reader (
.rdlen) and the block‑copy routine (.ldiron the 6502) are identical in spirit across all CPUs. Factoring them out keeps the main loop readable and lets you benchmark the cost of the underlying copy primitive directly. - Beware of 16‑bit loop bugs on 8‑bit machines. The 6502 section on the copy loop illustrates a classic off‑by‑one error when decrementing a low byte without checking for a borrow. The same pattern appears in early ColecoVision BIOS code; the fix is to test the low byte, then conditionally decrement the high byte.
Counter‑perspectives – why the other restrictions matter
The LZ4 specification also demands that the penultimate sequence’s back‑reference start at least 12 bytes before the block’s end, and that the final literal run be at least five bytes long (unless it is the only sequence). Those rules were introduced for historical decoders that could copy four bytes unconditionally with a 32‑bit move and then back‑track if they overshot. In our pure‑software decoders the benefit is marginal: the extra checks would add cycles without improving safety, because we already guard the destination buffer with the zero‑offset sentinel. Nevertheless, the rules remind us that many classic libraries were written with a mixture of speed tricks and hardware quirks in mind; when porting to a new CPU it is worth re‑evaluating whether those tricks still pay off.
Conclusion – a toolbox for retro compression
By walking through the same LZ4 decompressor on four very different chips, we see a spectrum of trade‑offs:
- Z80 – shortest code, fastest copy thanks to
LDIR. - 8080 – almost identical logic, but more manual pointer handling.
- 8086 – leverages powerful string instructions and 16‑bit registers to achieve a compact, high‑throughput implementation.
- 6502 – forces a more explicit, memory‑centric style, but still feasible with a modest amount of extra RAM.
The practical upshot is that any 8‑ or 16‑bit platform capable of a few hundred bytes of RAM can now decompress LZ4 data without external libraries, opening the door to richer graphics, sound, and level data on hobbyist consoles. Moreover, the mental exercise of translating a single algorithm across such disparate instruction sets sharpens the programmer’s intuition about register pressure, addressing modes, and the value of hardware‑provided block operations.
If you enjoyed this deep dive, stay tuned for next week’s experiment with high‑level language ports of the same decompressor. Until then, happy hacking!

Comments
Please log in or register to join the discussion