For years, text-to-SQL promised to democratize data access by translating natural language into database queries. Yet in production, teams at companies like Findly discovered a harsh truth: real business questions rarely map to single SQL statements. Questions like "Compare these cohorts over the last four releases and explain the variance" or "Create a trading strategy" demand multi-step workflows combining data retrieval, transformation, validation, and narrative explanation—far beyond what vanilla text-to-SQL delivers.

Why Text-to-SQL Alone Fails

Real-world analytics requires orchestration:
1. Planning: Decomposing problems into steps
2. Execution: Targeted SQL + Python for transformations
3. Validation: Sanity checks and data quality guards
4. Narrative: Explaining results with drill-down paths

Without this scaffolding, systems generate brittle queries that crumble on ambiguous joins, complex logic, or missing context. The solution? Treat text-to-SQL as one tool in an AI analyst’s toolbox—not the entire product.

The Semantic Layer: Your Business Logic Compiler

Findly’s breakthrough came from encoding business meaning explicitly via a semantic layer. Using Malloy—an open-source modeling language—they defined:
- Dimensions (e.g., region, order_date)
- Measures (e.g., total_revenue is sum(price * quantity))
- Relationships (e.g., orderscustomers)
- Metadata (descriptions, units, constraints)

# Malloy model snippet
source: orders extend {
  measure: total_revenue is sum(price * quantity)
  dimension: order_date is cast(order_timestamp as date)
  join_one: customers with customer_id
}

This acts as a "business logic compiler" that:
- Shrinks hallucination risk by restricting LLMs to validated concepts
- Enables pre-execution checks (e.g., catching invalid field references)
- Grounds code generation in single-source-of-truth definitions

Multi-Agent Workflows: The Orchestration Engine

Complex questions trigger a squad of specialized AI agents:
1. Planner: Breaks down the problem
2. Retriever: Fetches context via precision RAG
3. Coder: Generates SQL/Python
4. Validator: Runs sanity tests
5. Narrator: Explains results

Article illustration 1

Multi-agent planning isolates failures and maintains accountability across steps.

Retrieval: Recommendation Systems for LLMs

Treating retrieval as a recommendation pipeline was pivotal. Findly uses:
1. Keyword search: For exact acronyms/terms
2. Embeddings: For semantic matches
3. Fine-tuned rerankers: Instruction-optimized models (e.g., Voyage AI’s rerank-2.5) that boost relevance by 8%+

Article illustration 2

Instruction-following rerankers understand task-specific context like "filter to Q3 metrics"—critical for precision.

Python’s Strategic Role

SQL alone can’t handle:
- Statistical tests
- Time-series transforms
- Custom calculations

Sandboxed Python execution enables:

# Example: Currency conversion using semantic layer tags
df['revenue_eur'] = df['total_revenue_usd'] * get_exchange_rate('USD','EUR')

By composing pre-built functions (e.g., convert_currency()), LLMs generate concise, maintainable code.

Failure Modes & Fixes

Key production lessons:
- Ambiguous joins? Enforce grain/joins in semantic layer
- Slow responses? Route simple queries to faster models (e.g., o4-mini), complex ones to Gemini 2.5 Pro
- Silent errors? Add validation hooks and citation requirements

The Road to Adaptive Analytics

The future lies in systems that:
- Dynamically switch between fast/reasoning LLMs
- Autonomously critique and refine outputs
- Continuously harvest business logic from user interactions

As Pedro Nascimento notes: "Context isn’t just data—it’s the product." By architectural design, not prompt engineering, AI analysts move from demos to daily drivers.


Source: Lessons on building an AI data analyst by Pedro Nascimento (ex-Google/Twitter ML lead, Findly co-founder).