ts_zip: Pushing Text Compression Boundaries with Language Model Priors
#Machine Learning

ts_zip: Pushing Text Compression Boundaries with Language Model Priors

Trends Reporter
6 min read

A new experimental utility from Fabrice Bellard demonstrates how small language models can achieve state-of-the-art text compression ratios by exploiting learned statistical patterns, though at significant computational cost.

Fabrice Bellard, known for foundational contributions like FFmpeg and QEMU, has released ts_zip, an experimental text compression utility that replaces traditional dictionary and entropy coding approaches with a 169M parameter language model. The results are striking: on standard text benchmarks, it achieves compression ratios that substantially outperform conventional tools like xz, but the trade-offs reveal why this remains a research curiosity rather than a practical replacement.

The Compression Strategy

Traditional compressors like gzip, bzip2, or xz rely on identifying repeated patterns in the input data. They build dictionaries of substrings or apply transforms like the Burrows-Wheeler transform to cluster similar characters together, then use entropy coders like Huffman or arithmetic coding to encode the transformed data efficiently.

ts_zip takes a fundamentally different approach. It uses the RWKV 169M v4 language model—a relatively small recurrent neural network trained primarily on English text. During compression, the model predicts the probability distribution for the next token (word or subword unit) given the preceding context. An arithmetic coder then encodes the actual next token using these probabilities. The better the model's predictions, the fewer bits needed to encode each token.

This is essentially using the model's learned prior knowledge about language structure as a form of compression. The model "knows" that certain words are likely to follow others, that English sentences follow grammatical patterns, and that common words need fewer bits to encode when predicted accurately.

Performance Reality Check

The compression ratios are genuinely impressive:

  • alice29.txt (152KB): xz achieves 2.551 bits/byte, ts_zip reaches 1.142 bits/byte
  • book1 (768KB): xz at 2.717 bpb vs ts_zip at 1.431 bpb
  • enwik8 (100MB): xz at 1.989 bpb vs ts_zip at 1.106 bpb
  • enwik9 (1GB): xz at 1.707 bpb vs ts_zip at 1.084 bpb
  • linux-1.2.13.tar (9.3MB): xz at 1.441 bpb vs ts_zip at 1.021 bpb

These numbers represent a 30-45% reduction in compressed size compared to xz. On the Large Text Compression Benchmark, ts_zip would rank among the top performers for text compression.

However, the speed tells a different story. Even on an RTX 4090 GPU, compression and decompression reach only about 1 MB/s. This is orders of magnitude slower than conventional compressors which can process hundreds of MB/s. The requirement for 4GB of RAM and a GPU makes it impractical for most real-world scenarios.

Technical Implementation Details

The model uses 8-bit quantization per parameter and BF16 floating point for inference. This reduces memory footprint while maintaining reasonable accuracy. The arithmetic coder implementation ensures deterministic, reproducible results across different hardware configurations—a critical requirement for decompression to work reliably.

Bellard notes that the model was trained "mostly on English texts" but handles other languages and source code reasonably well. This makes sense: source code has strong syntactic patterns that language models learn, and the model's tokenization captures common programming constructs.

The Caveats and Limitations

Several factors limit practical adoption:

Speed: At 1 MB/s on high-end hardware, this is viable only for archival scenarios where compression time is irrelevant. Decompression at the same speed means extracting large files takes prohibitively long.

Hardware Requirements: The GPU dependency excludes most server environments, embedded systems, and mobile devices. CPU fallback would likely be even slower.

Model Specificity: The RWKV model was trained on English text. While it handles other content, its compression advantage diminishes on non-text data or languages outside its training distribution. Binary files compress poorly.

Version Instability: As experimental software, there's no guarantee of backward compatibility between versions. This makes it unsuitable for long-term archival where decompression must work decades later.

Memory Footprint: The 4GB RAM requirement stems from loading the model and maintaining inference state. This is substantial for a compression utility.

Why This Matters

Despite the limitations, ts_zip represents an important exploration of how machine learning priors can enhance compression. Traditional compressors treat all data as structureless sequences to be analyzed. Language models bring rich prior knowledge about text structure.

The approach mirrors how human compression works: we can often predict what someone will say next based on context, requiring fewer mental "bits" to process the information. The model encodes this predictive capability mathematically.

This also demonstrates that smaller language models (169M parameters is tiny by modern standards) have practical utility beyond text generation. They can serve as specialized tools for specific domains.

Community Response and Counterpoints

The compression community has shown interest but skepticism. The ratios are compelling, but the practical barriers are significant. Critics point out that:

  • For archival purposes, speed often matters less than size, making this viable for cold storage
  • The model could potentially be distilled or optimized for specific domains
  • Future hardware acceleration might improve speed
  • Hybrid approaches combining traditional methods with model-based compression could balance speed and ratio

However, defenders of traditional methods argue that decades of optimization have made tools like zstd and lz4 nearly as fast as memory bandwidth, with respectable ratios. The marginal gains from ts_zip rarely justify the complexity and resource costs.

The Broader Pattern

ts_zip fits into a larger trend of applying learned models to compression problems. Similar approaches have appeared in image compression (using neural networks), audio compression, and even video. The fundamental insight is that machine learning models can capture statistical regularities that hand-engineered algorithms miss.

Bellard's implementation is notable for its simplicity and transparency. The entire process is deterministic and reproducible, avoiding the black-box concerns that plague some ML applications. The choice of a relatively small, well-understood model architecture (RWKV) keeps the system comprehensible.

Practical Takeaways

For developers, ts_zip serves as a proof-of-concept rather than a drop-in tool. It shows what's possible when compression algorithms incorporate learned priors. The performance gap suggests that future work might focus on:

  • Faster model architectures optimized for inference speed
  • Domain-specific models trained on particular data types
  • Hybrid systems that use models only where they provide clear advantage
  • Hardware acceleration specifically designed for compression workloads

The utility also highlights an important trade-off in modern systems: we're increasingly willing to trade computational resources for better results, whether in compression, rendering, or inference. ts_zip pushes this trade-off to an extreme.

For archival and research scenarios where every byte counts and time is abundant, this represents a viable new tool. For everyday use, conventional compressors remain the pragmatic choice.

Bellard's work continues his pattern of exploring fundamental algorithms and reimplementing them with modern insights. Like his previous projects, ts_zip is both a practical tool and a statement about how computational techniques evolve—incorporating new paradigms while maintaining rigorous engineering standards.

The code and binaries are available for Linux and Windows, inviting experimentation and further development. Whether it remains a research curiosity or evolves into a practical tool depends on whether the speed limitations can be overcome through optimization, hardware support, or algorithmic breakthroughs.

Comments

Loading comments...