BobbyLLM/llama-conductor: A Router for Trustworthy LLM Workflows

A new open-source harness, llama-conductor, addresses the fundamental unreliability of large language models by enforcing deterministic workflows, providing verifiable provenance, and managing memory through a 'glass-box' architecture designed for users who demand consistency over conversational flair.

The promise of large language models as reliable reasoning engines often clashes with their inherent unpredictability. They hallucinate, forget context, and produce answers that feel plausible but lack grounding. For users who require precision—developers, researchers, or anyone with a low tolerance for "vibes-based" responses—this is a critical failure. The open-source project llama-conductor emerges from this frustration, proposing not a better model, but a better system. It is a router and harness that forces LLMs to operate as predictable components within a structured, verifiable pipeline.

At its core, llama-conductor is a rejection of the black-box chat paradigm. Instead of a free-form conversation, it orchestrates specific workflows: a default "Serious" mode for general queries, a "Mentats" mode for deep, grounded reasoning, and playful "Fun" modes that add personality without altering facts. The system's philosophy is encapsulated in its creator's stated goal: "In God we trust. All others must bring data." This isn't just a motto; it's an architectural principle. The router intercepts every request and response, applying filters and memory systems that operate independently of the LLM itself, using what the project calls "1990s tech"—JSON files and deterministic logic—to manage state.

The Problem: Unreliability and Context Bloat

Three primary issues plague standard LLM interactions, which llama-conductor directly targets. First is the "vibes-based answer" problem: when a model confidently states a fact, there is no built-in mechanism to verify its source. The project illustrates this with a common developer scenario: asking for a command-line flag. A standard model might invent a flag, and the user wastes time trying it. With llama-conductor's ##mentats mode, the query is routed exclusively to a curated knowledge base (the "Vault"). If the Vault contains no relevant information, the system refuses to answer, stating: "The Vault contains no relevant knowledge for this query." Only after the user has formally added documentation and generated a verifiable summary does the system provide an answer, complete with a SHA-256 hash of the source document for manual verification.

Second is the "goldfish memory" of context windows. Models forget earlier details in long conversations. Llama-conductor's "Total Recall" (TR) system stores facts as literal, verbatim entries in a JSON file, indexed by an ID. When a user stores a fact with !! my server is at 203.0.113.42, it's saved. Later, recalling it with ?? server retrieves the exact string, with metadata on its expiration and usage. This memory exists outside the model's context, preventing dilution and ensuring consistency.

Third is context bloat, where long chat histories consume precious VRAM and slow down or crash systems with limited resources. The "Vodka" component's "Cut The Crap" (CTC) filter acts as a context sanitizer. It keeps only the last N message pairs, optionally the first system prompt, and enforces a hard character cap. This ensures the prompt size sent to the model remains stable and small, allowing even low-VRAM systems to maintain consistent performance over extended sessions.

Architecture: A Stack of Deterministic Components

The llama-conductor stack is modular, designed to be assembled from existing, robust tools. The router itself is a Python FastAPI application. It communicates with model backends via an OpenAI-compatible API, typically through llama-swap, which dynamically loads models using llama.cpp. This allows users to swap models without restarting the entire stack.

For retrieval-augmented generation (RAG), it uses Qdrant as its vector database. The workflow for adding knowledge is deliberate and traceable:

Place documents in a designated folder (e.g., C:/docs/myKB/).
Attach the folder with >>attach myKB.
Generate a summarized, chunked version with >>summ new. This creates a SUMM_ file containing the summary and a SHA-256 hash of the original document.
Promote the summary to the Vault with >>move to vault, which stores the chunks and their embeddings in Qdrant.

When ##mentats is invoked, it performs a three-pass reasoning workflow: a "Thinker" drafts an answer using only retrieved Vault facts, a "Critic" checks for overstatement, and the "Thinker" refines the output. The final response is structured, listing the answer, facts used, constraints followed, and notes. The provenance trail is complete: the answer points to a Vault chunk, which points to a SUMM file, which points to the original document with its SHA-256 hash. A user can manually verify the hash to confirm the source material hasn't been tampered with.

Practical Implications for Resource-Constrained Environments

A significant implication of this architecture is its efficiency on low-end hardware, or "potato PCs." By decoupling memory and context management from the LLM's inference process, the system reduces the computational burden. Vodka's CTC keeps the active context window tiny and stable. Total Recall stores facts as disk lookups, not in the model's KV cache. Filesystem knowledge bases are pre-processed, so RAG doesn't require embedding every document on every query. Mentats, while the most computationally intensive part, runs only when explicitly invoked and can use smaller, local models for its Thinker and Critic roles.

The result is that a system with limited VRAM can run a more capable model because the context window isn't bloated by hundreds of messages. The project's model recommendations reflect this philosophy: it suggests efficient, smaller models like Qwen-3-4B or Phi-4-mini, which "punch way above their weight," rather than defaulting to the largest available. The creator's personal setup runs on a 4GB VRAM laptop, demonstrating that reliable, grounded LLM workflows are possible without enterprise-grade hardware.

Modes and Workflow Control

The router's modes provide clear boundaries for interaction. The default "Serious" mode uses attached filesystem KBs and Vodka's memory, suitable for everyday queries. "Mentats" is the high-stakes, proof-driven mode, isolated from chat history and filesystem KBs, relying solely on the curated Vault. The "Fun" modes add a layer of personality by appending relevant quotes from a curated file (quotes.md) to the response, while the "Fun Rewrite" mode rewrites the answer in the voice of a selected quote, adding sarcasm or incredulity without altering factual content.

This structured approach extends to troubleshooting. Common issues like "No Vault hits stored yet" or "PROVENANCE MISSING" have clear resolutions tied to the workflow. The system is designed to fail transparently, providing debug logs (mentats_debug.log) and explicit error messages that guide the user toward fixing the root cause, whether it's a missing document, a SHA mismatch, or a misconfigured model name in the backend.

Conclusion: A Shift from Conversation to Computation

llama-conductor represents a philosophical shift in how we interact with LLMs. It moves away from treating them as conversational partners and towards treating them as computational engines within a controlled, observable system. By prioritizing provenance, deterministic memory, and context management, it offers a path to using LLMs for tasks where reliability is paramount. The AGPL-3.0 license ensures that this harness remains open and modifiable, inviting users to build upon a foundation designed for trust, not just vibes. For anyone who has ever been burned by a confident hallucination, this project offers a methodical, transparent alternative.

For setup instructions and detailed configuration, consult the official README and the technical setup guide within the repository.