PrismML launches 1‑bit and ternary Bonsai Image 4B models for on‑device diffusion generation

PrismML’s new Bonsai Image 4B family compresses a 4‑billion‑parameter diffusion model to under 1 GB using binary or ternary weights, enabling 512‑pixel image generation on iPhone 17 Pro Max and Mac M4 Pro while retaining 88‑95 % of the original FLUX.2 quality. Open weights and an iOS app are released alongside the models.

PrismML – Compact diffusion models for phones and laptops

PrismML announced two new variants of its Bonsai Image 4B model, a 4 billion‑parameter diffusion generator that can run locally on consumer hardware. The key innovation is a drastic reduction of the transformer’s weight size:

Variant	Weight representation	Effective bits/weight	Model size (transformer)	Total payload*	Mean active memory (512×512)
1‑bit Bonsai Image 4B	Binary {‑1,+1} + FP16 group‑wise scale	1.125	0.93 GB (8.3× smaller)	3.42 GB	1.5 GB
Ternary Bonsai Image 4B	Ternary {‑1,0,+1} + FP16 scale	1.71	1.21 GB (6.4× smaller)	3.88 GB	1.96 GB

*Payload includes the compressed text encoder and FP16 VAE; the text encoder is off‑loaded after prompt encoding, so runtime memory is lower than the total size.

How the compression works

The models start from the FLUX.2 Klein 4B architecture, keeping the same attention and diffusion schedule. Only the large diffusion transformer is quantized:

Binary layers store each weight as a single bit and apply a per‑group FP16 scaling factor, achieving an effective 1.125 bits per weight.
Ternary layers add a zero state, giving a little more representational flexibility (1.71 bits per weight) and improving visual fidelity.

A tiny fraction (~5 %) of the network – the projection layers that are especially sensitive to precision – stay in FP16. This hybrid approach preserves most of the model’s expressive power while shrinking the part that dominates memory and bandwidth during each denoising step.

Performance on real devices

iPhone 17 Pro Max: 512×512 generation in ~9.4 s (binary) and ~8.2 s (ternary). The full‑precision FLUX.2 Klein 4B does not fit in the device’s memory at all.
Mac M4 Pro: 512×512 generation in ~6 s, up to 5.6× faster than the stock full‑precision MFLUX pipeline.
CUDA GPUs: The same binaries run via PrismML’s Gemlite low‑bit GEMM kernels, giving comparable speedups on desktop GPUs.

Quality versus footprint

Benchmarking on three widely used suites shows the trade‑off clearly:

Model	GenEval (object composition)	HPSv3 (human preference)	DPG‑Bench (prompt fidelity)	Size reduction vs FLUX.2 Klein 4B
1‑bit Bonsai Image 4B	0.671	11.15	0.822	8.3× (88 % of FLUX.2 quality)
Ternary Bonsai Image 4B	0.723	12.22	0.851	6.4× (95 % of FLUX.2 quality)
FLUX.2 Klein 4B	0.819	12.84	0.853	1× (baseline)
Stable Diffusion 1.5	1.72	0.396	0.601	4.5× (51 % of FLUX.2 quality)

The ternary variant retains almost all of the original model’s accuracy while staying under 2 GB of active memory, and the binary variant pushes the footprint below 1 GB with only a modest drop in scores. Both models dominate older 4 B‑class checkpoints such as Stable Diffusion 1.5 or BK‑SDM‑Small on the same memory budget.

Why on‑device generation matters

Running a diffusion model locally eliminates three practical constraints of cloud‑only APIs:

Latency – each denoising step happens on the device, cutting round‑trip time to milliseconds.
Cost – no per‑image serving fees; users can iterate freely.
Privacy – prompts and generated assets never leave the device, an advantage for creative tools, offline apps, or regulated industries.

Because image creation is inherently iterative, the ability to generate a new version in seconds rather than seconds plus network delay can change the user experience from “submit‑wait‑receive” to a fluid, interactive canvas.

Availability and ecosystem

Both variants are released under the Apache 2.0 license with open weights. The code and model checkpoints are hosted on the project’s GitHub repository, and a lightweight iOS client – Bonsai Studio – lets anyone try the models on an iPhone. Additional resources include:

A whitepaper detailing the quantization pipeline (PDF link).
A Hugging Face model card for easy loading in Python.
A WebGPU demo that runs in the browser via the Hugging Face Spaces interface.
The MLX low‑bit inference library for Apple Silicon and the Gemlite kernels for CUDA.

“We’ve spent years tackling one of the field’s hardest problems: compressing neural networks without sacrificing their reasoning ability.” – PrismML co‑founder, referencing the team’s background at Caltech and backing from Khosla Ventures, Cerberus, and Google.

What to watch next

PrismML’s release suggests a broader shift toward edge‑first diffusion: as quantization techniques mature, more powerful generative models will become feasible on smartphones, AR glasses, and even micro‑controllers. The company’s next roadmap milestone is a 8 B‑parameter variant that aims to stay under 3 GB while matching the visual fidelity of current 7‑10 B models.

Developers interested in experimenting can start with the following links:

GitHub repository – https://github.com/prismml/bonsai-image-4b
Hugging Face model hub – https://huggingface.co/prismml/bonsai-image-4b
Bonsai Studio for iOS – https://apps.apple.com/app/bonsai-studio
MLX low‑bit library – https://github.com/ml-explore/mlx
Gemlite CUDA kernels – https://github.com/gemlite-ai/gemlite

The open‑source nature of the release means the community can benchmark, fine‑tune, or integrate the models into existing pipelines, potentially accelerating the adoption of on‑device generative AI across creative, educational, and enterprise applications.

#Diffusion #quantization #Mobile AI #Open Source #Edge Computing