Why routing LLM calls is harder than it looks (lessons from building ai-gateway)
#LLMs

Why routing LLM calls is harder than it looks (lessons from building ai-gateway)

Backend Reporter
3 min read

Building an LLM routing layer revealed unexpected complexity in prompt classification, local embeddings, and cost-quality tradeoffs.

Most applications I've worked on treat LLMs in a very simple way: you pick a model, send every request to it, and hope for the best. At first, that works. But over time I kept running into the same problems: simple queries hitting expensive models, provider outages breaking entire flows, and no control over cost vs quality tradeoffs.

So I started building a small LLM routing layer that sits in front of model calls and decides which model should handle each request. This turned out to be way more interesting (and harder) than I expected.

The core idea

Instead of this: app → single LLM → response I wanted: app → router → (cheap model / reasoning model / fallback) → response

The router decides based on the prompt:

  • simple → cheaper/faster model
  • complex → reasoning model
  • failure → fallback provider

What I built

The system is a self-hostable gateway with:

  • multi-provider support (Groq, Gemini fallback)
  • intent-based routing (embedding similarity)
  • semantic caching to avoid repeated calls
  • health-aware failover across providers
  • multi-tenant API keys + quotas

For embeddings, I experimented with running a local BGE model via Transformers.js instead of using external APIs.

The hardest problem: routing decisions

This is where things get tricky. At first I used embedding similarity to classify prompts into categories like:

  • simple question
  • summarization
  • code/reasoning

It works well for clear cases. But ambiguous prompts break everything.

Example: "Explain this system design in simple terms"

Is that:

  • summarization?
  • reasoning?
  • both?

This is where simple heuristics start to fall apart.

Local embeddings: great idea, annoying reality

Running embeddings locally felt like a big win: no external API, no rate limits, more control. But in practice:

  • cold start takes ~2–5 seconds (ONNX init)
  • memory overhead (~30–50MB even for small models)
  • scaling becomes tricky

Once the model is warm, performance is fine. But that first request penalty is very real, especially for user-facing systems.

What actually worked

A few things that made a noticeable difference:

  • semantic caching → avoids recomputing embeddings and responses
  • fallback logic → makes the system much more reliable
  • cheap-first routing → try fast/cheap models, escalate if needed

What didn't work (yet)

  • purely heuristic routing (not reliable enough)
  • static thresholds for classification
  • assuming "simple vs complex" is easy to define

Where I think this goes next

The obvious direction is moving toward learning-based routing:

  • track which responses get escalated
  • use retries/failures as signals
  • gradually learn which model performs best per prompt type

Instead of hardcoding rules, let the system adapt over time.

Biggest takeaway

Building around LLMs isn't just about prompts. It's about:

  • cost control
  • reliability
  • system design

The model is just one part of the system.

Curious to hear from others

If you've worked on something similar:

  • How are you deciding which model to use?
  • Are you running embeddings locally or using APIs?
  • Have you tried any learning-based routing approaches?

Would love to hear how others are tackling this.

Featured image

Comments

Loading comments...