Why routing LLM calls is harder than it looks (lessons from building ai-gateway)

Building an LLM routing layer revealed unexpected complexity in prompt classification, local embeddings, and cost-quality tradeoffs.

Most applications I've worked on treat LLMs in a very simple way: you pick a model, send every request to it, and hope for the best. At first, that works. But over time I kept running into the same problems: simple queries hitting expensive models, provider outages breaking entire flows, and no control over cost vs quality tradeoffs.

So I started building a small LLM routing layer that sits in front of model calls and decides which model should handle each request. This turned out to be way more interesting (and harder) than I expected.

The core idea

Instead of this: app → single LLM → response I wanted: app → router → (cheap model / reasoning model / fallback) → response

The router decides based on the prompt:

simple → cheaper/faster model
complex → reasoning model
failure → fallback provider

What I built

The system is a self-hostable gateway with:

multi-provider support (Groq, Gemini fallback)
intent-based routing (embedding similarity)
semantic caching to avoid repeated calls
health-aware failover across providers
multi-tenant API keys + quotas

For embeddings, I experimented with running a local BGE model via Transformers.js instead of using external APIs.

The hardest problem: routing decisions

This is where things get tricky. At first I used embedding similarity to classify prompts into categories like:

simple question
summarization
code/reasoning

It works well for clear cases. But ambiguous prompts break everything.

Example: "Explain this system design in simple terms"

Is that:

summarization?
reasoning?
both?

This is where simple heuristics start to fall apart.

Local embeddings: great idea, annoying reality

Running embeddings locally felt like a big win: no external API, no rate limits, more control. But in practice:

cold start takes ~2–5 seconds (ONNX init)
memory overhead (~30–50MB even for small models)
scaling becomes tricky

Once the model is warm, performance is fine. But that first request penalty is very real, especially for user-facing systems.

What actually worked

A few things that made a noticeable difference:

semantic caching → avoids recomputing embeddings and responses
fallback logic → makes the system much more reliable
cheap-first routing → try fast/cheap models, escalate if needed

What didn't work (yet)

purely heuristic routing (not reliable enough)
static thresholds for classification
assuming "simple vs complex" is easy to define

Where I think this goes next

The obvious direction is moving toward learning-based routing:

track which responses get escalated
use retries/failures as signals
gradually learn which model performs best per prompt type

Instead of hardcoding rules, let the system adapt over time.

Biggest takeaway

Building around LLMs isn't just about prompts. It's about:

cost control
reliability
system design

The model is just one part of the system.

Curious to hear from others

If you've worked on something similar:

How are you deciding which model to use?
Are you running embeddings locally or using APIs?
Have you tried any learning-based routing approaches?

Would love to hear how others are tackling this.

#LLM routing #prompt classification #local embeddings #cost control #Model Selection