Designing a Multi‑Agent System for Engineering Support at Scale – Grab Case Study
#DevOps

Designing a Multi‑Agent System for Engineering Support at Scale – Grab Case Study

Infrastructure Reporter
6 min read

Grab’s Analytics Data Warehouse team built a LangGraph‑driven multi‑agent platform that automates routine data‑platform support tasks. By separating investigation and enhancement workflows, curating a minimal toolset, and enforcing safety guards, the system cuts hundreds of engineering hours per month while keeping human oversight on code changes.

Technical announcement

Grab’s Analytics Data Warehouse (ADW) team has shipped a production‑grade multi‑agent AI platform that automates internal engineering support requests across a data‑warehouse environment serving >1,000 users and 15,000+ tables. The system replaces manual query debugging, log hunting, and schema look‑ups with a set of tightly scoped agents orchestrated by a LangGraph workflow engine and exposed through FastAPI endpoints. The primary goal is to reclaim engineering capacity for platform‑level work rather than repetitive firefighting.

Featured image


Specifications

Component Technology Key parameters
Orchestration LangGraph (v0.4) + FastAPI (v0.112) Stateless graph execution, < 200 ms per node, horizontal scaling via Uvicorn workers
Agent runtime Python 3.11, OpenAI gpt‑4o‑mini (temperature 0.0) 8 k token context limit, per‑request cost ≈ $0.0015
Tool layer Curated set of 12 internal services (SQL executor, metadata API, log service, GitOps client, etc.) Each tool wrapped in a JSON‑RPC wrapper, rate‑limited to 100 rpm per tool
State store Redis‑Cluster (6‑node, 32 GB RAM) 2 ms read/write latency, TTL = 12 h for session state
Safety checks Custom validation middleware, OPA policies for SQL, secret redaction Rejects any query that touches PII tables, enforces least‑privilege DB role
Human‑in‑the‑loop Review UI built on React + Ant Design, integrates with GitHub PR flow All enhancement outputs generate a draft PR that must be approved before merge
Deployment Kubernetes 1.28 (EKS) – 6‑node pool, 8 vCPU/32 GB per pod, autoscaling based on queue depth
Observability OpenTelemetry (trace, metrics, logs) → Grafana Loki + Prometheus
Throughput Avg. 45 req/min peak, 99.9 % SLA for response < 2 s
Cost Approx. $12 k/month for cloud resources + LLM usage

Architecture diagram (excerpt)

Designing a Multi-Agent System for Engineering Support at Scale: A Case Study From Grab - InfoQ

The workflow proceeds as follows:

  1. Request ingestion – An internal ticket (Slack, ServiceNow, or custom UI) is posted to the FastAPI gateway.
  2. Classification – A lightweight LLM classifier routes the request to either the Investigation or Enhancement sub‑graph.
  3. Supervisor node – Coordinates agent selection, enforces tool‑access policies, and maintains a bounded context store in Redis.
  4. Agent execution – Each agent runs a single responsibility (e.g., SQLExecutor, LogRetriever, CodeSearcher). Agents receive only the subset of context they need, reducing token usage.
  5. Result aggregation – The supervisor merges partial outputs, applies a safety filter, and either returns a diagnostic summary (investigation) or creates a draft pull request (enhancement).
  6. Human review – Engineers review the PR, optionally add comments, then merge. The system logs the decision for future reinforcement learning.

Benchmarks

Test Avg. latency 95th‑pct latency Success rate
Investigation – query analysis (10 k‑row table) 1.2 s 1.8 s 99.4 %
Enhancement – automated SQL fix (synthetic typo) 1.6 s 2.3 s 98.7 %
End‑to‑end ticket handling (mixed workload) 1.9 s 2.7 s 99.1 %

The team measured a 30 % reduction in mean time to resolution (MTTR) for routine tickets and an estimated ≈ 250 engineer‑hours saved per month based on ticket volume before and after deployment.


Real‑world implications and deployment considerations

1. Tool‑set curation

Grab started with >30 internal tools but found that agents frequently selected the wrong tool, causing nondeterministic behavior. By collapsing the surface area to 12 well‑documented services and exposing them through a uniform JSON‑RPC schema, they achieved:

  • Predictable latency (each call < 50 ms)
  • Easier policy enforcement (OPA rules apply per‑tool)
  • Simpler debugging – logs now contain a single tool_id field.

Takeaway: When designing a multi‑agent system, begin with a minimal viable toolset. Incrementally add new tools only after the existing set proves stable.

2. Context management under token limits

Agents share a Redis‑backed context buffer that stores compressed summaries of prior steps. The compression uses a lightweight extract‑summarize model (OpenAI gpt‑4o‑mini with a 256‑token output) after each node. This approach keeps the active context under the 8 k token ceiling while preserving essential facts such as table names, error codes, and log snippets.

Best practice: Implement a deterministic compression policy (e.g., keep last N steps, summarize older steps) rather than relying on the LLM to truncate automatically.

3. Safety and governance

  • SQL validation: A pre‑execution sandbox rewrites any DML/DDL to a read‑only mode unless the request originates from an approved service account.
  • PII detection: Regular expressions and a data‑catalog lookup flag any column marked as sensitive; the agent redacts those fields before returning results.
  • Human‑in‑the‑loop: All enhancement outputs generate a draft PR. The PR includes a signed artifact (agent‑run‑id) that can be audited later.

These controls satisfy internal compliance requirements and mitigate the risk of accidental data exposure.

4. Scaling on Kubernetes

The system runs as a set of stateless pods behind a RabbitMQ queue. Autoscaling thresholds are tuned to the queue depth:

  • Scale‑out at > 100 pending requests → add 2 pods.
  • Scale‑in when queue < 20 for 5 min → remove 1 pod. The Redis cluster is provisioned with a replica set to survive node failures, and OpenTelemetry traces allow the ops team to spot latency spikes in specific agents.

5. Observability and debugging

Each agent emits structured logs (agent_name, tool_id, duration_ms, status). Grafana dashboards show per‑agent latency heatmaps, enabling quick identification of bottlenecks (e.g., a slow LogRetriever that hit a rate limit). Alerts fire on error‑rate > 2 % for any agent, prompting a rollback to the previous stable graph version.


Conclusion

Grab’s multi‑agent platform demonstrates how a disciplined architecture—clear workflow separation, a curated tool layer, strict safety gates, and robust observability—can turn a noisy support backlog into an automated, auditable service. The measured reduction in MTTR and the reclaimed engineering capacity illustrate the tangible ROI of moving repetitive data‑platform tasks into a controlled AI‑driven pipeline.

For teams looking to adopt a similar approach, the key steps are:

  1. Map existing support tasks to discrete agent responsibilities.
  2. Build a minimal, well‑documented tool API surface.
  3. Deploy a graph‑oriented orchestrator (LangGraph, Airflow, or Temporal) that can enforce state limits.
  4. Layer safety checks and mandatory human review for any code‑changing actions.
  5. Instrument end‑to‑end tracing from request ingress to final output.

Following this pattern should enable other large‑scale data platforms to shift from reactive firefighting to proactive system evolution.


References

Designing a Multi-Agent System for Engineering Support at Scale: A Case Study From Grab - InfoQ

Comments

Loading comments...