Meryem Arik explains why modern enterprises face “inference chaos” and how AI model gateways—open‑source tools like LiteLLM, Doubleword, Portkey, and Bifrost—provide a lightweight control layer for security, cost, and governance while still empowering teams to pick the best models for their workloads.
Technical announcement
At QCon AI 2026, Doubleword co‑founder Meryem Arik unveiled a practical blueprint for taming the growing mess of AI inference in large organizations. Her talk, The AI Gateway: Scaling Centralized Inference Across Decentralized Teams, showed how a thin, high‑performance AI model gateway can act as the single point of control for dozens of model providers—OpenAI, Anthropic, Mistral, self‑hosted vLLM instances, and even emerging ASIC‑based services—while still letting each product team choose the model that best fits its latency, cost, and data‑residency constraints.

Specifications
| Feature | Typical implementation | Open‑source options |
|---|---|---|
| Unified API | Normalizes OpenAI‑style JSON across providers | LiteLLM, Doubleword, Portkey, Bifrost, OpenRouter |
| Authentication & RBAC | JWT/OAuth2, group‑based policies, SSO integration (Entra ID, Okta) | All listed gateways support SSO; self‑hosted binaries let you bind to your corporate IdP |
| Request logging & audit | Structured logs (JSON) sent to ELK/Datadog; immutable storage for compliance | Built‑in log export in LiteLLM, Portkey; custom plug‑ins for any gateway |
| Model‑aware routing | Route by latency, cost, region, or request difficulty | Portkey’s smart router, Doubleword’s rule engine, Bifrost’s weight‑based routing |
| Rate‑limit & budget enforcement | Per‑group token caps, spend caps, burst limits | LiteLLM’s budget feature, Doubleword’s quota service |
| Guardrails / content filters | Optional pre‑processing, post‑processing hooks | Portkey includes guardrails; others expose hook points for custom filters |
| Latency budget | Target < 5 ms overhead on a 200 RPS workload (see benchmark below) | Doubleword (Rust implementation) ~1 ms; LiteLLM ~3‑4 ms; Bifrost ~2 ms |
Benchmark snapshot (single‑node, 32‑core, 2 × NVIDIA A100)
- Throughput: Doubleword 210 req/s, Bifrost 190 req/s, LiteLLM 60 req/s (older version)
- Average added latency: Doubleword 0.9 ms, Bifrost 1.2 ms, LiteLLM 3.8 ms
- 99th‑percentile latency: < 2 ms for Doubleword, < 3 ms for Bifrost
These numbers demonstrate that a well‑engineered gateway can be invisible to downstream services, a key requirement when latency‑sensitive use cases such as code‑completion or real‑time chat assistants are involved.
Real‑world implications
1. Empowering decentralized product teams
Each team can query a model catalog exposed by the gateway. The catalog lists every approved model together with metadata:
- Quality tier (Elo score, domain‑specific benchmark)
- Compliance tags (EU‑residency, HIPAA, PII‑safe)
- Cost per token and latency SLA
A data‑labeling team can pick a cheap, fine‑tuned vision model hosted on a private GPU cluster, while a coding‑assistant team selects a low‑latency Opus model for instant autocompletion. Because the gateway normalizes request payloads, swapping models requires only a configuration change—no code rewrite.
2. Centralizing governance without throttling innovation
The gateway enforces:
- RBAC: Only the “clinical‑analytics” group can call the proprietary cancer‑screening model.
- Spend caps: Interns receive a $15 budget; mission‑critical services get unlimited quotas.
- Data‑handling policies: Requests flagged as PII are automatically routed to models running in the EU region and logged for deletion after 30 days.
All actions are recorded in an immutable audit log, enabling post‑mortem cost analysis and regulatory reporting.
3. Optimizing GPU utilization for self‑hosted fleets
When teams independently spin up identical models on separate GPU nodes, utilization drops dramatically (often < 20 %). By funneling all inference through a single gateway, the platform can:
- Load‑balance across a shared pool of GPUs, achieving > 80 % average utilization.
- Burst‑scale: During a traffic spike, the gateway can spill over to a cloud provider (e.g., OpenAI) while keeping the primary workload on‑prem.
- Failover: If a provider experiences downtime, routing rules automatically switch to a backup model, preserving SLA.
4. Extending the gateway to agents and MCP servers
Meryem highlighted that the next generation of gateways will also mediate agentic workloads (LLM‑driven planners, tool‑calling agents) and MCP (model‑compute‑platform) servers. The same control plane—authentication, budgeting, routing—applies, meaning you can register an entire agent pipeline as a single logical model endpoint.
5. Deployment considerations
| Aspect | Recommendation |
|---|---|
| Hosting | Deploy the gateway as a containerized service behind your internal load balancer; use a side‑car for TLS termination. |
| State | Keep the gateway stateless; store quotas and logs in an external Redis/SQL store. |
| Observability | Export Prometheus metrics (gateway_requests_total, gateway_latency_seconds) and forward logs to your existing SIEM. |
| Scalability | Horizontal pod autoscaling based on request‑per‑second metrics; each instance can handle ~200 RPS on modest hardware. |
| Security | Enable mutual TLS, enforce short‑lived API tokens, and integrate with Azure Entra ID or Okta for group sync. |
Getting started
- Pick a gateway – for a quick proof‑of‑concept, spin up LiteLLM with Docker (
docker run -p 4000:4000 litellm/litellm). - Define your model catalog – create a YAML file mapping logical names (
code‑assistant,image‑labeler) to provider endpoints and credentials. - Configure RBAC – import Azure AD groups via the gateway’s
authzmodule; map each group to a budget. - Deploy – run the gateway in a Kubernetes namespace dedicated to AI infra; expose it via an internal
ClusterIPservice. - Migrate – update each product team’s client libraries to point at
https://gateway.internal.company/v1/chat/completions.
Open‑source resources
- LiteLLM – https://github.com/BerriAI/litellm
- Doubleword (Meryem’s gateway) – https://github.com/doubleword/ai-gateway
- Portkey – https://github.com/portkey-ai/portkey
- Bifrost – https://github.com/bifrost-ai/bifrost
- OpenRouter – https://openrouter.ai
Takeaways
- Decentralized teams need freedom – they must be able to select the model that matches their latency, cost, and domain requirements.
- Centralized inference is essential – it provides governance, cost control, and optimal GPU utilization.
- AI model gateways are the pragmatic tool – lightweight, open‑source, and easy to deploy, they let you keep the chaos out of production while still encouraging rapid innovation.
For a deeper dive, read the blog that inspired this talk: https://fergusfinn.com/blog/control-layer/.

Comments
Please log in or register to join the discussion