A recent CNCF benchmark measured three AI agent configurations—RAG‑only, hybrid RAG + local filesystem, and pure local clone—against real Kubernetes bugs. While RAG‑only was fastest and cheapest, all agents struggled with system‑wide impact analysis, highlighting the importance of high‑quality issue descriptions and the need for better scope‑discovery mechanisms.

Benchmarking AI Coding Agents on Kubernetes

Published May 15 2026 • 3 min read • Claudio Masolo, Senior DevOps Engineer, Nearform

Technical announcement

Brandon Foley’s recent CNCF blog post presented a controlled benchmark that integrates AI coding agents directly into the Kubernetes development workflow. The experiment measured three distinct agent configurations against nine real‑world bug reports drawn from the kubelet, scheduler, networking, storage, and apps subsystems. Each bug had already been fixed by a human contributor, providing a ground‑truth reference for correctness.

All agents received only the issue description—no pull‑request title, no diff, no additional context. The model behind every run was Claude Opus 4.6, constrained to a five‑minute execution window and a fixed output format (markdown patch). The sole variable was the code visibility strategy:

Configuration	Retrieval method	Code source
RAG‑only	KAITO RAG Engine (Qdrant) – BM25 + embedding search	Retrieved snippets only
Hybrid	RAG first, then local filesystem walk	Retrieved snippets plus full repo checkout
Local‑only	Direct clone of the repository, no index	Entire codebase

The study captured three primary dimensions: latency, cost (token usage), and correctness.

Specifications & benchmark results

1. Latency and cost

Config	Avg. latency	Avg. token count	Approx. API cost*
RAG‑only	76 s	12 k tokens	$0.018
Hybrid	152 s	21 k tokens	$0.032
Local‑only	138 s	18 k tokens	$0.027

*Costs calculated using the public Claude Opus pricing (≈$0.0015 per 1 k input tokens, $0.003 per 1 k output tokens). The hybrid approach is the most expensive because each RAG step triggers a full conversation replay, inflating the token count.

2. Correctness breakdown

Config	Fully correct patches	Partially correct (missing scope)	Incorrect
RAG‑only	3 / 9	5 / 9	1 / 9
Hybrid	4 / 9	4 / 9	1 / 9
Local‑only	4 / 9	4 / 9	1 / 9

Key observations

The dominant failure mode was incomplete fixes: agents resolved the immediate symptom but omitted ancillary changes (e.g., updating related validation logic or adjusting related CRDs).
Architectural drift appeared in 2 cases where agents introduced a new struct field instead of reusing an existing one, increasing surface area without functional benefit.
When the issue description explicitly named the target file, function, and expected behavior, all three configurations converged to high correctness (≥ 80 % fully correct).

3. Retrieval impact on reasoning

RAG‑only agents benefitted from a focused context window; the retrieved snippets acted as a natural “attention filter,” reducing hallucination risk.
Hybrid agents suffered from extra latency due to the mandatory RAG step, yet they did not show measurable correctness gains over the pure local clone.
Local‑only agents spent more tokens navigating the full repository, but the broader view did not translate into better system‑wide impact awareness.

Real‑world implications for DevOps teams

1. Prioritize high‑quality issue triage

The benchmark demonstrates that well‑crafted bug reports dramatically flatten performance gaps. Teams should enforce a minimal issue template that includes:

Exact file path(s)
Function name(s) involved
Expected vs. actual behavior
Relevant configuration flags or API versions

Embedding this metadata into the issue description enables any retrieval‑based agent to locate the correct code region quickly, cutting both latency and token spend.

2. Retrieval strategy selection

For time‑critical hot‑fixes (e.g., security patches), the RAG‑only configuration offers the best latency‑cost profile while still delivering acceptable correctness when the issue is well scoped.
When full repository context is required—such as refactoring across multiple packages—teams may opt for the local‑only approach, accepting higher cost for broader visibility.
The hybrid mode currently provides little added value; its overhead outweighs any marginal correctness benefit. Future work could explore smarter gating (e.g., skip RAG if a local diff already matches the query).

3. Scope‑discovery as the next research frontier

The study confirms that AI agents excel at local bug resolution but falter on system‑wide impact analysis. Potential mitigations include:

Structured agent skills: predefined playbooks that enumerate dependent components (e.g., “if you modify kubelet config, also check kube-proxy”).
Change impact graphs: integrate static analysis tools (e.g., Bazel query, Go mod graph) to surface transitive dependencies before the LLM generates a patch.
Iterative clarification loops: allow agents to ask follow‑up questions when the issue description lacks sufficient scope information, though this requires stateful conversation handling.

4. Operational considerations

Token budgeting: Deploy a per‑agent token quota to prevent runaway costs during large‑scale rollouts.
Observability: Instrument each agent run with latency, token count, and success metrics; feed these into a Prometheus dashboard for continuous performance tracking.
Security: When agents access a full repository clone, ensure the workspace runs with least‑privilege permissions and that secret files are excluded from the mount.

Deployment checklist for AI‑assisted bug fixing on Kubernetes

Set up a retrieval index (Qdrant, Milvus, or Elasticsearch) populated with the latest Kubernetes source tree.
Configure the LLM endpoint (Claude Opus, GPT‑4o, etc.) with a fixed token budget per request.
Create an issue‑template enforcing file, function, and expected behavior fields.
Wrap the agent in a CI job (GitHub Actions, Tekton) that:
- Pulls the latest issue description.
- Executes the selected retrieval mode.
- Generates a patch and runs kube‑test against the targeted component.
- Posts the result as a draft PR for human review.
Monitor latency, token usage, and patch acceptance rate; adjust retrieval mode based on observed cost‑benefit.

Looking ahead

The benchmark underscores a clear message: retrieval architecture matters for speed and cost, but not for deep reasoning. To push AI agents from “bug‑fix assistants” to “system‑wide refactoring partners,” the community must invest in tools that expose code dependencies and enable iterative clarification. Until then, the most reliable lever remains a well‑specified issue description.

For those interested in reproducing the study, the full dataset and scripts are available on the project’s GitHub repository: https://github.com/cncf/ai‑agent‑k8s‑benchmark.

Claudio Masolo is a Senior DevOps Engineer at Nearform. In his spare time he enjoys running, reading, and playing retro video games.

Author photo

#AI #Kubernetes #Benchmark #LLM #Retrieval

Benchmarking AI Coding Agents on Kubernetes Reveals Retrieval Trade‑offs and Scope‑Discovery Gaps