A deep‑dive into Cheng Huang’s three‑month effort to rebuild Azure’s Replicated State Library in Rust using AI coding agents, covering the concrete productivity gains, the novel contract‑driven testing workflow, lightweight spec‑driven development, and the limits that still require human oversight.
What a 100 K‑line Rust project really tells us about AI‑assisted development
Claim: AI can replace a team of engineers for a production‑grade consensus system
Cheng Huang reports that, with a handful of AI coding agents (Claude Code, Codex CLI, GitHub Copilot, etc.), he wrote 130 K lines of Rust implementing a full multi‑Paxos engine in roughly six weeks. The codebase includes leader election, log replication, snapshotting, configuration changes, and a test suite of more than 1 300 cases. Performance was tuned from 23 K ops / s to **300 K ops / s** on a laptop, and the resulting system supports pipelining and non‑volatile memory (NVM) – two major gaps in Azure’s original Replicated State Library (RSL) [1].
What’s actually new
1. A modern Rust re‑implementation of RSL
- Pipelined voting – requests no longer block while a previous vote is in flight, cutting latency dramatically.
- NVM‑aware log – the persistence layer uses the verified PoWER‑Never‑Corrupts design [5] to write directly to persistent memory, reducing commit latency on Azure‑grade hardware.
- Hardware‑aware abstractions – the codebase is structured to allow RDMA integration later, even though that part is still pending.
The core protocol logic follows the multi‑Paxos design described by Jay Lorch [4] and matches the feature set of the original RSL, but the implementation is idiomatic Rust, leveraging ownership, zero‑cost abstractions, and async‑free paths where possible.
2. Contract‑driven development powered by LLMs
- Contracts as first‑class artifacts – pre‑conditions, post‑conditions, and invariants are written in a DSL that the AI expands into runtime
assert!checks for testing and optionaldebug_assert!for production builds. - Test generation from contracts – once a contract exists, a prompt to Claude or Codex produces targeted unit tests covering edge cases.
- Property‑based testing – the AI translates contracts into
proptest‑style generators, automatically exploring large input spaces. One such generated test caught a subtle safety violation in the phase‑2a handler before any manual test could expose it.
3. Lightweight spec‑driven workflow
Instead of a heavyweight document chain (requirement → design → task list), Huang uses the spec‑kit toolset:
/specifycreates a short markdown spec with user stories and acceptance criteria./clarifyasks the model to critique and extend the spec, surfacing missing scenarios./plangenerates a concrete step‑by‑step plan for a single story.
The process treats a single user story as the atomic unit of AI work, which keeps the context size manageable and allows on‑the‑fly adjustments without breaking document consistency.
4. AI‑in‑the‑loop performance tuning
The optimization loop is essentially:
- Prompt AI to instrument latency points.
- Run a benchmark, collect traces.
- Let the model write a Python script to compute quantiles and highlight hot paths.
- Ask for concrete code changes (e.g., replace
Arc<Mutex<>>with lock‑freeAtomicUsize, eliminate redundantclone()calls, batch allocations). - Re‑measure.
Repeating this three‑week cycle raised throughput by ~13×. The biggest wins came from:
- Reducing async task spawns on the critical path.
- Consolidating allocation‑heavy buffers into a reusable arena.
- Applying zero‑copy deserialization for network messages.
Limitations and open questions
| Area | Current state | Remaining challenges |
|---|---|---|
| Correctness | 1 300+ tests, contract‑driven property checks. | Formal verification is still missing; contracts are only runtime checks and can be disabled in production. |
| Hardware support | Pipelining and NVM integrated. | RDMA integration is still a TODO; the codebase will need careful unsafe‑FFI handling that AI alone may not audit safely. |
| AI autonomy | Human reviews contracts, test output, and performance suggestions. | Fully hands‑off pipelines (e.g., AI writes, runs, and validates contracts without human sign‑off) remain speculative. |
| Scalability of prompts | Prompt sizes stay under 8 K tokens by focusing on single stories. | Larger architectural changes (e.g., redesigning the replication log) may exceed context limits, requiring manual chunking. |
| Cost & rate limits | Two paid Anthropic plans and a ChatGPT Plus subscription were needed to avoid throttling. | Economic feasibility for larger teams is unclear; many organizations would need to budget for multiple concurrent API keys. |
Takeaways for practitioners
- AI excels at repetitive, well‑scoped tasks – writing boilerplate, generating tests from explicit contracts, and suggesting micro‑optimizations.
- Human oversight is still essential – contracts must be reviewed, performance regressions need domain knowledge, and unsafe code paths require manual audit.
- Tooling matters – a CLI‑first workflow (Codex CLI) that can be scripted fits naturally into a build‑test‑refine loop. IDE‑centric approaches tend to introduce more context‑switching.
- Cost discipline drives usage – treating the AI subscription as a sunk cost (the “$100/month forcing function”) can keep momentum high, but teams should monitor token usage to avoid runaway bills.
Wish list for the next generation of AI‑assisted coding
- End‑to‑end story execution – a model that can ingest a user story, generate spec, write code, produce contracts, and run the full test suite without manual prompting.
- Automated contract lifecycle – AI that maintains contract‑test sync, updates failing contracts after a refactor, and flags when a contract becomes obsolete.
- Self‑optimizing performance loops – a system that can explore a search space of compiler flags, data‑structure variants, and micro‑benchmarks, reporting the Pareto‑optimal trade‑offs.
- Integrated cost awareness – prompts that include token‑budget constraints and suggest cheaper alternative models when appropriate.
References & resources
- Azure Replicated State Library (RSL) – internal Microsoft documentation (referenced in the post).
- Jay Lorch’s design markdown for multi‑Paxos: https://github.com/microsoft/replicated-state-library-design
- PoWER‑Never‑Corrupts paper (OSDI 2025): https://www.usenix.org/conference/osdi25/presentation/power-never-corrupts
- Spec‑kit CLI: https://github.com/speckit/spec-kit
- Codex CLI repository: https://github.com/openai/codex-cli
- Claude Code documentation: https://docs.anthropic.com/claude-code

This article reflects a practitioner’s perspective on what AI‑driven development can achieve today, where it still falls short, and what concrete improvements would make the workflow truly autonomous.

Comments
Please log in or register to join the discussion