Lessons from Building Deep Research Agents in Production – A Technical Deep‑Dive | LavX News

Sarang Kulkarni (ThoughtWorks) shares the architecture, tooling, and operational lessons learned while deploying an Agentic RAG++ system for drug‑discovery research. The article breaks down the multi‑loop design, performance trade‑offs, failure modes, and emerging “harness engineering” practices that make autonomous AI agents reliable in regulated environments.

Technical announcement

At the Arc of AI Conference 2026, ThoughtWorks senior architect Sarang Kulkarni presented the production‑grade Deep Research Agentic RAG++ system that powers multi‑step, internet‑scale research for pharmaceutical R&D teams. The talk covered the end‑to‑end pipeline, from a Retrieval‑Augmented Generation (RAG) chatbot prototype to a fully autonomous agent capable of planning, executing, reflecting, and drafting regulatory‑grade reports. The solution is now running on a hybrid cloud fleet serving dozens of concurrent drug‑discovery projects, handling queries that span internal clinical trial data, scientific literature, and public biomedical repositories.

Specifications

Component	Technology stack	Key parameters
LLM core	Anthropic Claude‑3.5 Sonnet (API)	175 B parameters, temperature 0.2, max‑tokens 8 k
Retrieval layer	Hybrid weighted search (BM25 + dense embeddings) using FAISS + Elasticsearch 8.12	20 initial context chunks → re‑rank → 7 refined chunks
Text‑to‑SQL tool	Custom LLM‑driven parser + PostgreSQL 15	Error‑feedback loop, query latency < 150 ms
Agentic loop engine	LangChain 0.2 orchestrating think → plan → act → reflect steps	Supports up to 12 sub‑steps per request
Memory store	Redis‑AI with vector persistence, TTL 48 h	2 GB per agent instance
Observability	OpenTelemetry + Prometheus + Grafana dashboards	Latency, token‑usage, retry counts per loop
Compliance guardrails	Policy‑as‑code (OPA) + audit‑log sink (AWS CloudTrail)	Data residency: EU‑West‑1 only

Benchmarks (internal)

Test case	Avg. latency (ms)	Token cost per query	Success rate
Simple fact lookup (single‑hop)	420	120	99.3 %
Multi‑hop literature synthesis (3‑step)	1 240	480	94.7 %
Full research loop (plan‑execute‑reflect‑write)	3 560	1 200	88.2 %
Draft‑write‑redraft cycle	2 130	820	91.5 %

The primary bottlenecks are LLM token latency and vector search re‑ranking. Kulkarni’s team mitigated these by caching top‑k embeddings per domain and pre‑warming the LLM with “think” prompts that prune irrelevant branches early.

Real‑world implications and deployment considerations

1. Failure‑mode taxonomy

Mode	Symptom	Mitigation
Context anxiety – the agent consumes more tokens than budgeted, leading to truncated answers.	Token‑usage spikes > 1.5 k per step.	Introduce a token‑budget guard that forces a re‑plan when the projected usage exceeds a threshold.
Incomplete data – missing fields in source PDFs cause hallucinations.	Reflections report “unknown” but downstream write loop still includes placeholder text.	Add a data‑completeness validator that triggers a secondary retrieval pass before the write loop.
Long‑horizon drift – planning diverges after several act steps.	Final report omits early‑stage hypotheses.	Insert an inspect step after each act to compare current state against the original plan; if divergence > 15 % trigger a replanning cycle.
SQL execution errors – malformed queries from the text2sql tool.	DB returns syntax error, LLM repeats same query.	Feed error back into the LLM via a self‑correction prompt; limit retries to three attempts.

2. Harness engineering

Kulkarni coined harness engineering to describe the systematic construction of:

Tool wrappers (RAG, text2sql, web‑scraper) that expose a stable API and enforce input validation.
Memory primitives – short‑term scratchpad (Redis list) and long‑term vector store (FAISS) with explicit TTLs.
Validation checks – OPA policies that reject any retrieval that touches non‑compliant data sources.
Feedback loops – automatic logging of “think‑plan‑act” decisions to a central audit store for post‑mortem analysis.

The principle is that as LLM capabilities improve, the harness can be thinned; today the harness accounts for roughly 70 % of the codebase.

3. Observability & compliance

Telemetry: Each loop emits OpenTelemetry spans with attributes loop.type, token.count, latency.ms, and error.code. This enables real‑time alerting on latency spikes or repeated token‑budget breaches.
Audit trails: All external HTTP calls (e.g., PubMed API) are logged with request IDs and masked payloads to satisfy GDPR and HIPAA audit requirements.
Safety nets: A “human‑in‑the‑loop” gate is enforced before the final report is signed off; the gate displays the process reflection summary and lets reviewers approve or request a re‑run.

4. Scaling strategy

Horizontal scaling: Deploy each agent instance as a Kubernetes pod behind a ClusterIP service. Autoscale based on Prometheus metric agent_loop_duration_seconds with a target of 2 s per step.
GPU off‑loading: The LLM calls are routed to a dedicated NVIDIA H100 node pool; the retrieval layer runs on CPU‑only nodes to keep cost low.
Cost model: Average monthly spend per concurrent research project is ≈ $3 500 (LLM API + GPU time). Bulk discounts are achievable by negotiating enterprise contracts with the LLM provider.

Takeaways for engineers building autonomous AI agents

Explicit loops beat “one‑shot” prompts – decomposing a complex task into think‑plan‑act‑reflect stages yields a 10‑% increase in success rate for multi‑hop queries.
Token budgeting is a first‑class concern – treat token limits like memory limits; enforce them early to avoid context anxiety.
Harness engineering bridges the gap between raw model capability and production reliability; invest in reusable tool wrappers and policy checks.
Observability must be built into the agent core – without per‑step metrics you cannot detect drift or compliance violations until after a costly failure.
Human oversight remains essential in regulated domains; design a clear hand‑off point where the agent presents its process reflection for review.

By sharing these concrete specifications, benchmark results, and operational lessons, Kulkarni’s presentation provides a practical blueprint for teams looking to move beyond chat‑style assistants toward truly autonomous research agents that can operate at scale in high‑stakes environments.

For more details on the underlying frameworks, see the official LangChain documentation and the OpenTelemetry spec.

#AI_Agents #RAG #LLM #Observability #compliance

Lessons from Building Deep Research Agents in Production – A Technical Deep‑Dive