The AI Gateway: How Centralized Inference Keeps Decentralized Teams Productive

Meryem Arik explains why modern enterprises face “inference chaos” and how AI model gateways—open‑source tools like LiteLLM, Doubleword, Portkey, and Bifrost—provide a lightweight control layer for security, cost, and governance while still empowering teams to pick the best models for their workloads.

Technical announcement

At QCon AI 2026, Doubleword co‑founder Meryem Arik unveiled a practical blueprint for taming the growing mess of AI inference in large organizations. Her talk, The AI Gateway: Scaling Centralized Inference Across Decentralized Teams, showed how a thin, high‑performance AI model gateway can act as the single point of control for dozens of model providers—OpenAI, Anthropic, Mistral, self‑hosted vLLM instances, and even emerging ASIC‑based services—while still letting each product team choose the model that best fits its latency, cost, and data‑residency constraints.

Specifications

Feature	Typical implementation	Open‑source options
Unified API	Normalizes OpenAI‑style JSON across providers	LiteLLM, Doubleword, Portkey, Bifrost, OpenRouter
Authentication & RBAC	JWT/OAuth2, group‑based policies, SSO integration (Entra ID, Okta)	All listed gateways support SSO; self‑hosted binaries let you bind to your corporate IdP
Request logging & audit	Structured logs (JSON) sent to ELK/Datadog; immutable storage for compliance	Built‑in log export in LiteLLM, Portkey; custom plug‑ins for any gateway
Model‑aware routing	Route by latency, cost, region, or request difficulty	Portkey’s smart router, Doubleword’s rule engine, Bifrost’s weight‑based routing
Rate‑limit & budget enforcement	Per‑group token caps, spend caps, burst limits	LiteLLM’s budget feature, Doubleword’s quota service
Guardrails / content filters	Optional pre‑processing, post‑processing hooks	Portkey includes guardrails; others expose hook points for custom filters
Latency budget	Target < 5 ms overhead on a 200 RPS workload (see benchmark below)	Doubleword (Rust implementation) ~1 ms; LiteLLM ~3‑4 ms; Bifrost ~2 ms

Benchmark snapshot (single‑node, 32‑core, 2 × NVIDIA A100)

Throughput: Doubleword 210 req/s, Bifrost 190 req/s, LiteLLM 60 req/s (older version)
Average added latency: Doubleword 0.9 ms, Bifrost 1.2 ms, LiteLLM 3.8 ms
99th‑percentile latency: < 2 ms for Doubleword, < 3 ms for Bifrost

These numbers demonstrate that a well‑engineered gateway can be invisible to downstream services, a key requirement when latency‑sensitive use cases such as code‑completion or real‑time chat assistants are involved.

Real‑world implications

1. Empowering decentralized product teams

Each team can query a model catalog exposed by the gateway. The catalog lists every approved model together with metadata:

Quality tier (Elo score, domain‑specific benchmark)
Compliance tags (EU‑residency, HIPAA, PII‑safe)
Cost per token and latency SLA

A data‑labeling team can pick a cheap, fine‑tuned vision model hosted on a private GPU cluster, while a coding‑assistant team selects a low‑latency Opus model for instant autocompletion. Because the gateway normalizes request payloads, swapping models requires only a configuration change—no code rewrite.

2. Centralizing governance without throttling innovation

The gateway enforces:

RBAC: Only the “clinical‑analytics” group can call the proprietary cancer‑screening model.
Spend caps: Interns receive a $15 budget; mission‑critical services get unlimited quotas.
Data‑handling policies: Requests flagged as PII are automatically routed to models running in the EU region and logged for deletion after 30 days.

All actions are recorded in an immutable audit log, enabling post‑mortem cost analysis and regulatory reporting.

3. Optimizing GPU utilization for self‑hosted fleets

When teams independently spin up identical models on separate GPU nodes, utilization drops dramatically (often < 20 %). By funneling all inference through a single gateway, the platform can:

Load‑balance across a shared pool of GPUs, achieving > 80 % average utilization.
Burst‑scale: During a traffic spike, the gateway can spill over to a cloud provider (e.g., OpenAI) while keeping the primary workload on‑prem.
Failover: If a provider experiences downtime, routing rules automatically switch to a backup model, preserving SLA.

4. Extending the gateway to agents and MCP servers

Meryem highlighted that the next generation of gateways will also mediate agentic workloads (LLM‑driven planners, tool‑calling agents) and MCP (model‑compute‑platform) servers. The same control plane—authentication, budgeting, routing—applies, meaning you can register an entire agent pipeline as a single logical model endpoint.

5. Deployment considerations

Aspect	Recommendation
Hosting	Deploy the gateway as a containerized service behind your internal load balancer; use a side‑car for TLS termination.
State	Keep the gateway stateless; store quotas and logs in an external Redis/SQL store.
Observability	Export Prometheus metrics (`gateway_requests_total`, `gateway_latency_seconds`) and forward logs to your existing SIEM.
Scalability	Horizontal pod autoscaling based on request‑per‑second metrics; each instance can handle ~200 RPS on modest hardware.
Security	Enable mutual TLS, enforce short‑lived API tokens, and integrate with Azure Entra ID or Okta for group sync.

Getting started

Pick a gateway – for a quick proof‑of‑concept, spin up LiteLLM with Docker (docker run -p 4000:4000 litellm/litellm).
Define your model catalog – create a YAML file mapping logical names (code‑assistant, image‑labeler) to provider endpoints and credentials.
Configure RBAC – import Azure AD groups via the gateway’s authz module; map each group to a budget.
Deploy – run the gateway in a Kubernetes namespace dedicated to AI infra; expose it via an internal ClusterIP service.
Migrate – update each product team’s client libraries to point at https://gateway.internal.company/v1/chat/completions.

Open‑source resources

LiteLLM – https://github.com/BerriAI/litellm
Doubleword (Meryem’s gateway) – https://github.com/doubleword/ai-gateway
Portkey – https://github.com/portkey-ai/portkey
Bifrost – https://github.com/bifrost-ai/bifrost
OpenRouter – https://openrouter.ai

Takeaways

Decentralized teams need freedom – they must be able to select the model that matches their latency, cost, and domain requirements.
Centralized inference is essential – it provides governance, cost control, and optimal GPU utilization.
AI model gateways are the pragmatic tool – lightweight, open‑source, and easy to deploy, they let you keep the chaos out of production while still encouraging rapid innovation.

For a deeper dive, read the blog that inspired this talk: https://fergusfinn.com/blog/control-layer/.

#AI #Infrastructure #Open Source #Model gateways #cost control