The Enduring Elegance of bzip: Why This Forgotten Algorithm Still Wins at Compression
#Regulation

The Enduring Elegance of bzip: Why This Forgotten Algorithm Still Wins at Compression

Tech Essays Reporter
5 min read

A deep dive into why bzip and its Burrows-Wheeler Transform remain surprisingly effective for compressing code and text, outperforming modern algorithms in specific use cases despite being dismissed as obsolete.

When faced with the challenge of compressing code for a Minecraft mod with limited disk space, I discovered something surprising: bzip, an algorithm widely considered obsolete, still outperforms modern compression methods for text-like data. This isn't just a quirky finding—it reveals fundamental insights about compression algorithms and their appropriate use cases.

The Compression Landscape: Not All Algorithms Are Created Equal

The story begins with a simple question: what's the shortest, simplest, most ratio-efficient compression algorithm for compressing Lua code? Initially, this seemed like a complex question full of tradeoffs. But the answer turned out to be surprisingly clear-cut.

Testing various well-known encoders on a 327 KB file containing Lua code with occasional English text in comments yielded these results:

  • uncompressed: 327,005 bytes
  • gzip (zopfli --i100): 75,882 bytes
  • zstd -22 --long --ultra: 69,018 bytes
  • xz -9: 67,940 bytes
  • brotli -Z (recompiled without dictionary): 67,859 bytes
  • lzip -9: 67,651 bytes
  • bzip2 -9: 63,727 bytes
  • bzip3: 61,067 bytes

The bzip family emerged as a clear winner by a substantial margin, even beating lzip, whose documentation claims it compresses "most files" better than bzip2.

The Fundamental Difference: BWT vs LZ77

What makes bzip so effective? The answer lies in a fundamental architectural difference. While all other popular compression algorithms are based on LZ77—a scheme that replaces repetitive text with short links to earlier occurrences—bzip uses the Burrows-Wheeler Transform (BWT).

LZ77-based methods require complex heuristics to predict and encode token parameters like offsets and lengths, which vary wildly from location to location. A good algorithm needs sophisticated prediction to succinctly encode these parameters.

BWT takes a different approach: it reorders characters in the text to group them by context. Instead of predicting tokens based on similar earlier occurrences, you simply look at the last few symbols. Surprisingly, with BWT ordering, you don't even need to store where each symbol came from.

For example, when the word "hello" is repeated multiple times in text, LZ77 must find and insert new references at each occurrence. With BWT, all continuations of "hell" are grouped together, likely resulting in a sequence of many "o" characters in a row—something simple run-length encoding can handle efficiently.

The BWT Tradeoff: Context Matters

BWT does come with downsides. If you concatenate two texts in different English dialects (like using "color" vs "colour"), BWT will mix the continuations of "colo" in an unpredictable order, forcing you to encode a weird sequence of "r" and "u" characters. LZ77 would prioritize recent history and handle this better.

However, for consistent data like code, BWT works remarkably well without any special preprocessing.

Simplicity and Predictability: The Hidden Advantage

Another fascinating aspect of bzip is its deterministic nature. Unlike LZ77-based methods that support different compression levels because searching for earlier occurrences is time-consuming, BWT is entirely deterministic and free of heuristics.

There's no degree of freedom in determining how to efficiently encode lengths and offsets of backreferences (since there are none). All that's left are run lengths—a single number smaller than typical offsets. This means if you know what the bzip2 pipeline looks like, you can quickly achieve similar compression ratios without fine-tuning or worrying about edge cases.

My unoptimized ad-hoc bzip2-like encoder compressed the same input to about 67 KB—better than lzip and with clear avenues for improvement.

Decoder Size: The Overlooked Metric

When targeting Lua or creating self-extracting archives, decoder size matters tremendously. Measuring ELFs is useless in this context, and Lua libraries like LibDeflate don't optimize code size for self-extracting archives.

By examining the core decompression loops, we can estimate decoder sizes:

  • gzip: ~1.5 KB (Huffman coding with bit trie)
  • xz: ~1 KB (bit-by-bit encoding, no tables)
  • lzip: ~1 KB (similar to xz)
  • zstd: ~3 KB (FSE with large tables)
  • brotli: ~2.2 KB (multiple Huffman tables)
  • bzip2: ~2.1 KB (MTF + RLE + Huffman with 6 tables)
  • bzip3: ~1.5 KB (XZ-like bit-by-bit coding with context mixing)

With a single Huffman table instead of the default six, my bzip-style decoder fits in 1.5 KB—smaller than everything but xz and lzip, while being faster and more compact.

Performance: Context Is Everything

The common wisdom that "bzip is slow" needs qualification. When compression is used to circumvent a hard limit, the difference between bzip and gzip isn't about booting slowly versus quickly—it's about booting versus not booting at all.

If unpacking initrd takes 0.5 seconds instead of 0.3 seconds, that's not a big deal compared to being unable to boot. Saying bzip is slower than gzip begs the question: gzip cannot easily produce an optimal output like bzip can, so it searches for earlier occurrences heuristically with a configurable time-ratio tradeoff.

On the decoding side, bzip is slow because inverting BWT requires random access. This is less of an issue in high-level languages like Lua, where all operations are slow anyway. Most typical compression techniques are even more expensive in this context.

The Deeper Lesson: Structure vs. Heuristics

My attempts to create custom compression algorithms revealed a profound insight: real-world data is structured in ways that humans can't easily guess but computers can recognize on the fly. Attempts to compress code based on code structure struggle because they're only capable of compressing individual tokens.

This connects to a broader pattern in data compression: improvements come not from complicating algorithms, but from restructuring data and applying more powerful general-purpose methods. The 2009 article "Rationale for a Large Text Compression Benchmark" describes this connection to machine learning in detail.

Today, the top contender of the Large Text Compression Benchmark is nncp—and you can surely guess what "NN" stands for.

Conclusion: The b in bzip Stands for "Best"

bzip might be suboptimal as a general-purpose compression format, but it's great for text and code. Its encoders are less riddled with heuristics and word-of-mouth design decisions than LZ77-based encoders, leaving less room for subtle mistakes that greatly affect the compression ratio.

bzip decoding is quite fast when implemented in a high-level language. For specific use cases where you need maximum compression ratio with minimal decoder size, bzip remains surprisingly competitive—even against modern algorithms that have supposedly superseded it.

The b in bzip might as well stand for "best" when it comes to compressing code and text in constrained environments.

Comments

Loading comments...