Agentic Engineering Takes Shape: Lessons from the Octobatch Experiment
#AI

Agentic Engineering Takes Shape: Lessons from the Octobatch Experiment

Backend Reporter
4 min read

A veteran developer recounts how he built a 21k‑line Monte‑Carlo batch orchestrator entirely with AI code generators, turning the effort into a concrete practice called AI‑Driven Development (AIDD). The article outlines the problem of unstructured AI assistance, the disciplined solution of assigning roles, managing state, and validating outputs, and the trade‑offs between full automation and human oversight.

Agentic Engineering Takes Shape: Lessons from the Octobatch Experiment

Featured image

The problem – AI code helpers are powerful but chaotic

Developers have been handed a flood of LLM‑powered assistants such as Claude Code, GitHub Copilot, Cursor, and Gemini. The hype is split in two camps: one predicts that these tools will make software engineers obsolete, the other treats them as harmless autocomplete. Both extremes ignore the reality most engineers face – a toolbox that can write code, but offers no systematic way to keep that code coherent, testable, or maintainable at scale.

Typical symptoms:

  • Thousands of lines of AI‑generated code drift apart from the intended architecture.
  • Review processes become bottlenecks because humans must manually scan massive diffs.
  • Bugs surface only after the system has already been deployed, making rollback expensive.
  • Junior and senior engineers alike struggle to decide when to trust the model and when to intervene.

The gap between “what we should do” (review, test, document) and “how we actually do it” is the core obstacle to productive agentic engineering.


A disciplined solution – AI‑Driven Development (AIDD)

The author’s response was to treat the entire development lifecycle as an orchestrated workflow, mirroring how batch processing frameworks coordinate independent jobs. The resulting methodology, AI‑Driven Development (AIDD), consists of three layers:

  1. Habits (the Sens‑AI Framework) – five reflexive practices that shape every prompt:
    • Provide context – ship the relevant files, schemas, and design docs with each request.
    • Research before prompting – verify terminology, edge‑cases, and API contracts.
    • Frame precisely – use concrete, testable specifications rather than vague goals.
    • Iterate deliberately – limit each generation to a single, reviewable change.
    • Apply critical thinking – treat every output as a hypothesis, not a fact.
  2. Practices – concrete techniques such as multi‑LLM coordination, prompt templating, and schema‑based output validation. For example, the author used Claude for high‑level architecture, Gemini for cross‑checking, and a separate LLM as a test‑generator.
  3. Values – guiding principles (e.g., prefer deletion over addition, keep the system crash‑recoverable) that steer decisions when the practices give no clear answer.

The workflow mirrors a classic orchestrator:

  • Role assignment – each model gets a narrow responsibility (architecture, implementation, validation).
  • Hand‑off management – outputs are stored in a manifest file that acts as the single source of truth.
  • State persistence – a tick‑model (wake, check state, work, persist, exit) guarantees that a crash never loses progress.
  • Cost accounting – token usage is logged per batch, enabling budget forecasts before a job is submitted.

Trade‑offs and why they matter

Aspect Full automation (vibe‑coding) AIDD (orchestrated)
Human effort Minimal prompt writing, but massive post‑mortem debugging. Higher upfront discipline; reduces downstream toil.
Code quality Plausible but brittle; hidden statistical bias can slip through. Explicit validation stages catch regressions early.
Scalability Rate‑limit throttling; hard to run >100 concurrent calls. Batch APIs cut cost by ~50 % and allow parallel processing of tens of thousands of prompts.
Failure handling Manual retries; often lose state on crash. Built‑in retry, partial‑failure extraction, and crash‑recovery logic.
Vendor lock‑in Tied to a single assistant’s UI. Context files and prompt templates are portable across Claude, Gemini, Cursor, etc.

The author’s experiment showed that the disciplined approach reduced total active development time to ~75 hours for a 21 k‑line, 1 k‑test Python system—a speedup that would be impossible without systematic orchestration.


Concrete takeaways for teams

  1. Treat batch APIs as infrastructure – think of LLM calls like MapReduce jobs. Submit a file of prompts, poll for completion, and let the provider handle parallelism.
  2. Persist every state transition – a manifest that records which prompts have been sent, which responses succeeded, and which need retry protects you from crashes.
  3. Validate with a second model – using a different LLM to score or rewrite the first model’s output catches hallucinations that the primary model missed.
  4. Budget early – token‑count estimation per batch lets you forecast cost before you hit the provider’s quota.
  5. Never assume the AI will delete – if a change feels like a simplification, explicitly ask the model to remove code rather than add more.

Looking ahead

The next article in the series will dissect the Octobatch architecture itself, showing a real‑world pipeline from prompt to Monte‑Carlo simulation results. Subsequent posts will dive deeper into multi‑LLM coordination, automated test generation, and how to extract actionable insights from the full transcript of AI‑human interactions.

For teams ready to move beyond ad‑hoc prompt‑and‑copy, the AIDD framework offers a repeatable, auditable path to scale AI‑assisted development without surrendering architectural control.

Comments

Loading comments...