Building an LLM routing layer revealed unexpected complexity in prompt classification, local embeddings, and cost-quality tradeoffs.
Most applications I've worked on treat LLMs in a very simple way: you pick a model, send every request to it, and hope for the best. At first, that works. But over time I kept running into the same problems: simple queries hitting expensive models, provider outages breaking entire flows, and no control over cost vs quality tradeoffs.
So I started building a small LLM routing layer that sits in front of model calls and decides which model should handle each request. This turned out to be way more interesting (and harder) than I expected.
The core idea
Instead of this: app → single LLM → response
I wanted: app → router → (cheap model / reasoning model / fallback) → response
The router decides based on the prompt:
- simple → cheaper/faster model
- complex → reasoning model
- failure → fallback provider
What I built
The system is a self-hostable gateway with:
- multi-provider support (Groq, Gemini fallback)
- intent-based routing (embedding similarity)
- semantic caching to avoid repeated calls
- health-aware failover across providers
- multi-tenant API keys + quotas
For embeddings, I experimented with running a local BGE model via Transformers.js instead of using external APIs.
The hardest problem: routing decisions
This is where things get tricky. At first I used embedding similarity to classify prompts into categories like:
- simple question
- summarization
- code/reasoning
It works well for clear cases. But ambiguous prompts break everything.
Example: "Explain this system design in simple terms"
Is that:
- summarization?
- reasoning?
- both?
This is where simple heuristics start to fall apart.
Local embeddings: great idea, annoying reality
Running embeddings locally felt like a big win: no external API, no rate limits, more control. But in practice:
- cold start takes ~2–5 seconds (ONNX init)
- memory overhead (~30–50MB even for small models)
- scaling becomes tricky
Once the model is warm, performance is fine. But that first request penalty is very real, especially for user-facing systems.
What actually worked
A few things that made a noticeable difference:
- semantic caching → avoids recomputing embeddings and responses
- fallback logic → makes the system much more reliable
- cheap-first routing → try fast/cheap models, escalate if needed
What didn't work (yet)
- purely heuristic routing (not reliable enough)
- static thresholds for classification
- assuming "simple vs complex" is easy to define
Where I think this goes next
The obvious direction is moving toward learning-based routing:
- track which responses get escalated
- use retries/failures as signals
- gradually learn which model performs best per prompt type
Instead of hardcoding rules, let the system adapt over time.
Biggest takeaway
Building around LLMs isn't just about prompts. It's about:
- cost control
- reliability
- system design
The model is just one part of the system.
Curious to hear from others
If you've worked on something similar:
- How are you deciding which model to use?
- Are you running embeddings locally or using APIs?
- Have you tried any learning-based routing approaches?
Would love to hear how others are tackling this.


Comments
Please log in or register to join the discussion