The new GitHub repo shenli/distributed-system-testing provides two markdown‑based AI skills that let a coding assistant design a claim‑driven test plan and then execute it against a distributed system. The approach codifies decades of fault‑injection research into a reproducible workflow, but it still relies on manual claim extraction, limited oracle support, and a fairly heavyweight execution environment.

AI‑driven claim‑focused testing for distributed systems – an early look at shenli/distributed-system-testing

What the repository claims

The project ships two SKILL.md files that can be consumed by any AI coding agent capable of reading markdown and invoking a shell (Claude Code, Codex, Copilot CLI, Cursor, Gemini, …). One skill designs a structured test plan from the product’s public claims; the other executes that plan, captures evidence, and produces a findings report with a nine‑state verdict taxonomy. The output is a set of markdown artifacts that a reviewer can read and decide whether the system is ready to ship without re‑running the tests.

Key advertised properties:

Claim‑driven rather than test‑driven – each scenario tries to falsify a specific promise made by the system.
Automatic discovery of existing tests, runbooks, and fault‑injection scaffolding before inventing new harnesses.
Mandatory inclusion of an abstract model (log, lock, ledger, …), an operation‑history schema, a named checker (linearizability, serializability, etc.), and a nemesis description for consistency‑critical scenarios.
Verdicts are classified into nine explicit states, and every failure is tagged with a blame category (SUT, harness, checker, environment).
Installation is a one‑liner that clones the repo into ~/.local/share/distributed-testing-skills/ and creates the necessary symlinks.

What is actually new

1. Structured markdown workflow for AI agents

Most existing AI‑assisted testing tools stop at generating a test stub. Here the design skill produces a full plan that includes:

Architectural summary and scope
A matrix linking each claim to hypothesised failure modes
A technique selection catalog derived from Jepsen, Elle, and other academic work
A coverage adequacy argument and a confidence delta

The execute skill then follows a disciplined session layout (test‑sessions/<UTC>/…) that records logs, metrics, and per‑scenario verdicts. The plan and findings are both human‑readable markdown, which lowers the barrier for code reviewers who are not familiar with the underlying testing framework.

2. Integration of abstract models and checkers

For every consistency‑critical scenario the plan forces the author to declare:

Model under test (e.g., a log or a lock)
Operation‑history schema (a 11‑field record used by the checker)
Checker (Porcupine for linearizability, custom serializability scripts, etc.)
Nemesis (fault injection script plus observable landing evidence)

This mirrors the Jepsen methodology but packages it into a reusable markdown template that an AI can fill automatically.

3. Nine‑state verdict taxonomy

Instead of a binary pass/fail, the taxonomy distinguishes:

PASS‑hardening
PASS‑soft
FAIL‑reproducible
FAIL‑non‑reproducible
INCONCLUSIVE‑fault‑not‑proven
PARTIAL‑model‑coverage
... (and two more states for environment‑only failures and unknown outcomes)

The extra granularity helps reviewers understand whether a failure is a genuine product defect or an artifact of the test harness.

4. Evaluation against a real system (AgentDB)

The repo includes a verification folder with end‑to‑end runs on AgentDB, a Rust‑based distributed runtime. Those runs generated a 670‑line plan covering 16 hypotheses and surfaced six concrete findings, three of which were shipped as PRs. This demonstrates that the workflow can be applied to a non‑toy codebase.

Limitations and open questions

Area	Concern
Claim extraction	The design skill assumes that product claims are discoverable in documentation or code comments. If the claim set is incomplete, the resulting plan will miss important failure modes.
Oracle availability	Checkers like Porcupine require a clean operation history. Systems that do not expose sufficient audit logs will need additional instrumentation, which the skill does not automatically add.
Execution overhead	Running full‑scale nemesis scripts, collecting per‑scenario logs, and performing linearizability checks can be expensive. The current implementation is best suited for pre‑release validation rather than continuous integration.
Agent dependence	Although the skills claim to work with any markdown‑aware agent, the examples and test harnesses are tuned for Claude Code. Porting to other agents may require manual adjustments to the skill paths and environment variables.
Coverage argument	The adequacy argument is a narrative written by the AI. There is no formal proof that the selected hypotheses exhaust the claim space; reviewers must still trust the AI’s reasoning.
Blame classification	The nine‑state taxonomy provides a label, but the reduction process (bisecting fault windows, seed fixing) is still manual in many cases. Automated root‑cause isolation remains an open research problem.

How it fits into the broader testing ecosystem

The repository bridges a gap between two worlds:

Academic fault‑injection frameworks (Jepsen, Elle) that offer powerful but heavyweight test harnesses.
AI‑generated test scaffolding that often stops at a single unit test.

By embedding the Jepsen‑style model‑checker discipline into a markdown workflow that AI agents can populate, the project makes systematic distributed‑system testing more accessible to teams that already use AI coding assistants.

Getting started

Install the skills with the one‑liner from the repo’s INSTALL.md.
Ask your agent to design a test plan for this system – the skill will output testing-plans/<slug>.md.
Review the plan, adjust the claim list if needed, then ask the agent to execute the plan – it will create a test‑sessions/<UTC>/ directory with logs and a findings report.

For detailed prompts and usage tips see the repository’s USAGE.md.

Final thoughts

shenli/distributed-system-testing is an ambitious attempt to codify best‑practice distributed‑system testing into a format that AI agents can consume. The structured markdown artifacts, explicit model‑checker coupling, and nuanced verdict taxonomy are genuine contributions. At the same time, the approach still depends on high‑quality claim documentation, sufficient observability in the system under test, and a willingness to tolerate the resource cost of full‑scale fault injection. Teams that already run Jepsen‑style experiments will find the workflow familiar, while newcomers will need to invest in instrumentation before the skills can deliver reliable results.

#distributed systems #AI Testing #fault-injection #Jepsen #Test Automation

AI‑driven claim‑focused testing for distributed systems – an early look at shenli/distributed-system-testing

AI‑driven claim‑focused testing for distributed systems – an early look at shenli/distributed-system-testing

What the repository claims

What is actually new

1. Structured markdown workflow for AI agents

2. Integration of abstract models and checkers

3. Nine‑state verdict taxonomy

4. Evaluation against a real system (AgentDB)

Limitations and open questions

How it fits into the broader testing ecosystem

Getting started

Final thoughts

Comments