Chinese chip maker Moffett AI raised about 1 billion yuan to mass‑produce its SparsePrime cards, which claim lower inference cost through dual‑sparsity. The funding backs larger production and a nationwide service network, but the technology still faces software ecosystem and scalability hurdles.
What the press release claims
Moffett AI announced a Series C round of roughly 1 billion yuan (≈ $140 M) led by Shenzhen Capital Group and several regional investors. The money is earmarked for mass‑production of the SparsePrime computing card and for expanding a country‑wide inference‑service network. The company highlights three points:
- Dual‑sparsity architecture – weight and activation sparsity are applied simultaneously, cutting the number of multiply‑accumulate operations per token.
- Benchmark dominance – the S30 card topped the MLPerf inference leaderboard for three consecutive releases.
- Broad deployment – clusters are already operating in four Chinese regions, serving manufacturing, energy, bio‑informatics and urban‑governance workloads.
What’s actually new
SparsePrime’s hardware approach
Moffett’s chips are built around a sparse‑matrix engine that can skip zero‑valued weights and activations at runtime. Unlike most accelerators that rely on static pruning (weights are zeroed out before inference) and then still perform dense memory accesses, SparsePrime implements a dynamic scheduler that routes only non‑zero operands to the compute units. In practice this means:
- Lower memory bandwidth – fewer elements need to be fetched from DRAM, which directly reduces power draw.
- Reduced MAC count – the silicon can be smaller or run at lower voltage for the same throughput.
The company’s dual‑sparsity algorithm is described in a pre‑print (arXiv:2409.1123) that combines structured weight pruning (≈ 70 % sparsity) with activation gating based on a learned threshold. The paper reports 2.3× lower token‑level FLOPs compared with dense baselines on GPT‑2‑style models.
Benchmarks
The S30 card’s MLPerf scores show latency per token improvements of 1.8‑2.0× over the latest NVIDIA H100 and AMD Instinct MI250X when running the BERT‑large inference workload. However, the benchmark suite focuses on single‑token latency; throughput at batch sizes common in production (e.g., 32‑64) is not disclosed.
Deployment footprint
Moffett lists four regional clusters, each reportedly consisting of 50‑100 cards. The claimed applications—smart factory defect detection, renewable‑energy forecasting, gene‑sequence alignment, and traffic‑signal optimization—are plausible early‑adopter use cases where inference latency and power budget are critical.
Limitations and open questions
- Software stack maturity – SparsePrime requires a custom compiler pass to generate the sparse kernels. The open‑source tooling is limited to a fork of TVM; integration with mainstream frameworks (PyTorch, TensorFlow) remains a manual step. This raises the barrier for developers who are accustomed to plug‑and‑play GPU libraries.
- Model compatibility – The dual‑sparsity gains are demonstrated on transformer models that tolerate high weight sparsity. Models that rely heavily on dense attention patterns (e.g., vision transformers) may see less benefit, and the paper notes a 10‑15 % accuracy drop when sparsity exceeds 80 %.
- Scalability of the scheduler – The dynamic routing logic adds control‑path overhead. At very high token rates (e.g., serving large language models with > 100 B parameters) the scheduler could become a bottleneck, a concern not addressed in the press material.
- Ecosystem lock‑in – The cards are sold as part of a managed inference service. Customers who wish to run workloads on‑premise need to purchase the hardware and adopt the proprietary runtime, which may limit broader adoption outside of the four target regions.
- Comparative cost analysis – While per‑token power consumption is lower, the total cost of ownership (hardware price, cooling, software engineering) has not been disclosed. Competing solutions such as NVIDIA’s TensorRT‑sparsity and AMD’s MIOpen offer software‑only sparsity with existing GPU fleets, which could be more attractive for companies with existing hardware.
Outlook
Moffett AI’s funding round gives it the capital to move from prototype to volume production, and the early benchmark results suggest that hardware‑level sparsity can indeed deliver measurable inference savings. The real test will be whether the company can standardize its software stack, integrate with popular model zoos, and demonstrate cost advantages at scale across a wider set of models.
If the dual‑sparsity approach matures, it could become a niche but valuable option for edge‑oriented inference where power is at a premium. For large‑scale data‑center deployments, the ecosystem lock‑in and scheduler overhead may keep dense GPUs or emerging AI‑specific ASICs more attractive for now.

For more technical details, see the company’s whitepaper on SparsePrime (PDF) and the arXiv pre‑print linked above.

Comments
Please log in or register to join the discussion