Powering the Future: Building a GenAI Infrastructure Stack at Intuit
#Infrastructure

Powering the Future: Building a GenAI Infrastructure Stack at Intuit

Infrastructure Reporter
5 min read

Merrin Kurian walks through Intuit’s GenOS platform, the “fixed‑flexible‑free” framework, and the engineering practices that let 8 000 developers run 3 500+ AI experiments daily. The article details the architecture, tooling, failure‑mode analysis, and governance needed to scale generative‑AI agents in an enterprise.

Powering the Future: Building a GenAI Infrastructure Stack

Featured image

Technical announcement

Intuit has opened the details of its Generative AI Operating System (GenOS) – a unified stack that lets product teams create, test, and ship AI agents at enterprise scale. The platform is built around three design pillars:

  • Fixed – mandatory security, compliance, and identity controls that every developer inherits automatically.
  • Flexible – a curated set of model families, tool‑integration patterns, and runtime options that can be swapped without breaking the platform contract.
  • Free – self‑service APIs and CI/CD pipelines that let engineers experiment without waiting for central approvals.

Merrin Kurian, Distinguished Engineer, presented the architecture at QCon San Francisco (May 19 2026) and explained how the stack supports more than 450 000 requests per day and consumes 4 trillion+ tokens each month.


Specifications

Core components

Component Role Key specs
AI Workbench Integrated IDE for prompt authoring, evaluation, and fine‑tuning. Supports 70+ LLM versions, vector‑store indexing, automated A/B testing.
GenRuntime Stateless execution engine that hosts agents, handles tool calls, and enforces guardrails. Horizontal scaling to 10 k concurrent invocations, sub‑millisecond latency for cheap models, up to 2 min for reasoning models.
GenUX Front‑end widget library (chat, form, multimodal upload) that renders agent interactions. React‑compatible, built‑in telemetry, configurable SLO templates.
Registries Central stores for prompts, agents, tools, and use‑case metadata. Versioned with Git‑style semantics, supports rollback to any prior revision.
Evaluation pipeline End‑to‑end test harness that runs LLM‑as‑judge scoring, regression suites, and cost analysis. Supports custom metrics (precision@k, hallucination rate, token‑cost per request).

Model portfolio

  • Reasoning tier – 15‑parameter models optimized for planning and tool orchestration (≈ 150 ms latency, 0.8 $/M tokens).
  • Workhorse tier – 6‑parameter models for bulk text generation (≈ 30 ms latency, 0.12 $/M tokens).
  • Lightweight tier – 1‑parameter models for summarization and classification (≈ 5 ms latency, 0.03 $/M tokens).

Fine‑tuning is performed on a per‑business‑unit basis; QuickBooks, for example, runs personalized transaction‑categorization models for each small business, reducing required training samples from millions to a few thousand.

Deployment pipeline (Agent Starter Kit)

  1. Scaffoldgenos init creates a repo with CI/CD YAML, Dockerfile, and test stubs.
  2. Prompt registrygenos prompt push stores the prompt version and triggers a synthetic‑eval job.
  3. Tool registrationgenos tool register adds a REST or GraphQL endpoint to the tool registry, automatically generating OpenAPI‑compatible wrappers.
  4. Runtime packaginggenos build produces a container image with the selected model tier and guardrail plugins.
  5. Canary rollout – Deploy to 5 % of traffic; telemetry streams to the GenOS observability layer.
  6. Full rollout – After passing regression thresholds (hallucination < 2 %, latency < SLO), promote to 100 %.

The kit includes semantic caching (vector‑based lookup of recent LLM responses) to cut token usage by up to 30 % for repetitive queries.


Real‑world implications

Scaling agent experiments

  • 8 000 developers have access to the platform; 1 300 actively build agents.
  • Over 3 500 production experiments have been launched, handling 450 k requests/day.
  • Token consumption peaked at 4 trillion in a single month, demonstrating that the cost‑control features (model tiering, semantic caching, fine‑tuning) are essential to keep budgets predictable.

Failure‑mode taxonomy

Intuit identified a set of recurring failure patterns for multi‑agent systems:

  • Wrong tool selection – the planner calls an API with mismatched parameters.
  • State drift – the agent forgets prior steps, leading to duplicated actions.
  • Hallucinated output – generated text contains fabricated data, triggering a compliance alert.
  • Infinite loops – the orchestration graph fails to reach a terminal state.

Each pattern is captured as a trace event in GenRuntime and fed to the LLM‑as‑judge evaluator. The evaluator produces a score that feeds back into the CI pipeline, automatically rejecting builds that exceed a configurable risk threshold.

Governance and compliance

GenOS integrates with Intuit’s existing security stack:

  • Identity – OAuth 2.0 tokens are enriched with fine‑grained policy tags before reaching the LLM service.
  • Data privacy – All inbound documents are scanned for PII; flagged content is redacted before being passed to a model.
  • Audit – Every agent invocation logs the model version, prompt hash, and tool call payload to an immutable audit store (AWS Q‑LDB).

The platform participates in NIST AI Risk Management Framework work items, ensuring that new guardrails can be rolled out across all agents with a single configuration change.

Operational considerations

  • Latency budgeting – Because reasoning models can take minutes, GenOS splits a request into two parallel paths: a fast‑path for UI‑responsive feedback (using the lightweight tier) and a background reasoning path that updates the UI when complete.
  • Scaling semantics – Semantic caching reduces repeated RAG lookups; the cache is sharded by vector hash to avoid hot‑spoting.
  • Observability – Custom dashboards aggregate per‑agent KPIs: success rate, token cost, average latency, and hallucination score. Alerts trigger automatically when any KPI deviates > 10 % from the historical baseline.

Preparing your own GenAI stack

  1. Treat prompts as first‑class artifacts – store them in a versioned registry, run automated regression tests on each change.
  2. Instrument every tool call – capture input, output, and latency; feed the data into a continuous evaluation pipeline.
  3. Adopt a “fixed‑flexible‑free” mindset – lock down security/compliance early (fixed), expose a curated set of model families and tool adapters (flexible), and provide self‑service pipelines (free).
  4. Invest in multimodal UX – agents that can accept images, PDFs, or voice reduce friction for end users and increase the value of the underlying LLM.
  5. Plan for incremental rollout – start with 5 % canary, monitor guardrail metrics, then expand. This mitigates the risk of sudden traffic spikes that can overwhelm legacy APIs.

Conclusion

Intuit’s GenOS demonstrates that a disciplined platform approach—combining a robust runtime, self‑service developer tooling, and rigorous governance—can turn a chaotic wave of generative‑AI experiments into a predictable, cost‑controlled production service. The “fixed, flexible, free” framework provides a template for any enterprise looking to move from isolated proof‑of‑concepts to a fleet of reliable AI agents.


For the full slide deck and transcript, see the official InfoQ presentation page.

Comments

Loading comments...