Rebellions Unveils Industry's First Quad-Chiplet AI Accelerator at ISSCC 2026

Rebellions debuts Rebel100, a four-chiplet AI accelerator using UCIe interconnects that matches Nvidia H200 performance at lower power consumption.

Rebellions, a South Korean AI accelerator designer, has unveiled its groundbreaking Rebel100 AI accelerator at the International Solid-State Semiconductor Conference (ISSCC) 2026, marking a significant milestone in multi-chiplet design. The Rebel100 represents one of the industry's first multi-chiplet designs to rely on UCIe-A interconnects, stitching together four chiplets into a unified system-in-package (SiP) that claims to match Nvidia's H200 performance at a lower power envelope.

Breaking New Ground in Multi-Chiplet Design

Multi-chiplet designs have become essential as the demand for AI and HPC performance continues to outpace traditional process technology scaling. Industry giants like AMD, Intel, and Nvidia have already embraced this methodology, but Rebellions' approach with the Rebel100 represents a particularly aggressive implementation of the Unified Chiplet Interconnect Express (UCIe) standard.

The Rebel100 SiP comprises four 320mm² neural processing unit (NPU) dies, each equipped with a 12Hi HBM3E 36 GB memory stack, providing 144 MB of HBM3E per package. These dies are manufactured using Samsung's performance-enhanced SF4X process technology and packaged using Samsung's I-CubeS advanced packaging method with an interposer.

UCIe-A Interconnect Architecture

What sets the Rebel100 apart is its use of UCIe-Advanced die-to-die interfaces running at 16Gbps, providing an aggregated bandwidth of 4 TB/s. The interconnect achieves approximately 11ns Flit-Aware Die-to-Die (FDI) to FDI latency, extending memory load-store semantics transparently across chiplets. This allows the SiP to behave as a single processor rather than a cluster of discrete dies.

(Image credit: Rebellions)

Each chiplet integrates two Neural Core Clusters, with each cluster packing eight neural cores and 32 MB of shared memory. The shared memory is partitioned into 16 slices with an aggregate bandwidth of 64 TB/s. The chiplet contains 64 routers that form an 8×4 granular mesh topology with three logically separate channels: Data (D), Request (R), and Control (C).

System-Level Performance

On the system side, the Rebel100 connects to hosts via two PCIe 5.x x16 interfaces that support SR-IOV and peer-to-peer operation. The company claims that one Rebel 100 SiP can deliver 2 FP8 PFLOPS or 1 FP16 PFLOPS of performance without sparsity at 600W, which is in line with what Nvidia's H200 can deliver at 700W.

Rebellions also claims the unit can achieve 56.8TPS on LLaMA v3.3 70B with single-batch 2k/2k input/output sequences. The company positions its Rebel 100 quad-chiplet package as a foundational unit for cross-node and rack-level systems capable of supporting trillion-parameter models and million-token contexts.

Advanced Data Movement Engine

Rebellions built a fairly aggressive data-movement engine to keep its quad-chiplet design fed. Each NPU die integrates a configurable DMA subsystem with eight execution engines that can pull data from local HBM3E, remote HBM3E located on another chiplet, or from distributed shared memory. Bandwidth per DMA can reach up to 2.6 TB/s.

The on-chip 2D network-on-chip (NoC) uses a straightforward XY routing scheme, where packets first travel along one axis and then the other, with turn restrictions applied to avoid deadlocks. Arbitration inside routers is handled using a weighted round-robin mechanism, allowing traffic from different sources to get serviced fairly while maintaining adjustable priority.

Synchronization and Reliability

Coordinating work across four chiplets requires careful synchronization. Rather than relying on a dedicated scheduler, Rebellions implemented synchronization managers in each NPU. Each chiplet integrates a dedicated hardware synchronization manager with hardwired control logic that can coordinate activity across dies, either under centralized control or in a more autonomous manner.

The architecture specifically avoids direct peer-to-peer communications between units and inter-unit dependencies to cut down unnecessary traffic and coordination overhead, keeping overall utilization high during different execution phases of LLM inference.

For commercial deployments, Rebellions added a configurable switching mode that uses the aforementioned features to sacrifice a small amount of performance in exchange for improved MTBF and MTTF characteristics to maximize uptime—crucial for large AI clusters where uptime matters more than marginal throughput gains.

Power Delivery Innovation

The Rebel100 is rated for a thermal design power of 600W, but instantaneous transient surges—when multiple neural cores switch on—exceed the nominal level by two times. To mitigate this, Rebellions implemented a hardware staggering technique that offsets start times of neural cores instead of activating them simultaneously, smoothing current ramps and reducing supply noise.

Measurements show that synchronized switching produces steep current spikes and noticeable voltage disturbance, whereas staggered activation results in gentler transitions and a more stable power rail. Additional control logic dynamically limits instruction issue rate over short time windows to further reduce sudden load changes both within a chiplet and across dies.

Memory traffic adds another layer of stress. HBM3E bursts can be just as demanding as compute surges, putting extra strain on the power delivery network. To reinforce it, Rebellions added dedicated integrated silicon capacitor (ISC) dies that embed distributed capacitance across the VDD rails to serve both the NPU and the HBM3E PHY. This approach further dampens voltage oscillations and lowers impedance peaks compared to a design without ISC dies.

Market Implications

The Rebel100 represents a solid example of multi-chiplet design that relies on industry-standard interconnection while still using proprietary techniques to maximize performance and optimize power consumption. By achieving similar performance to Nvidia's H200 at a lower power envelope, Rebellions has demonstrated that smaller, more manageable dies can compete with monolithic designs while offering advantages in yield and development complexity.

While it's unclear whether Rebellions plans to build bigger SiPs using existing chiplets, the company certainly envisions its partners building scale-up and scale-out clusters containing from dozens to tens of thousands of such AI accelerators. This positions the Rebel100 as a foundational building block for the next generation of AI infrastructure, particularly for inference workloads requiring trillion-parameter models and million-token contexts.

The success of the Rebel100 could accelerate adoption of UCIe technology across the industry, as it demonstrates the practical benefits of standardized multi-chiplet interconnects while showcasing how proprietary optimizations can still provide competitive advantages in specific workloads.

#AI Accelerator #UCIe #multi-chiplet #HBM3e #Nvidia H200