How Semantic Routers Cut Claude Code Skill Tokens by 456×
#LLMs

How Semantic Routers Cut Claude Code Skill Tokens by 456×

Startups Reporter
6 min read

An empirical test of the “skills as semantic router” pattern shows that indexing a community‑wide skill catalog into a vector store reduces Claude Code’s prompt size from ~228 K tokens to under 2 K per task, while still returning the right skill in the top‑5 for 87.5% of queries.

Introduction

When Claude Code agents load a skill catalog, they traditionally dump every skill’s name and description into the system prompt. For a community collection of 4,556 skills that means roughly 228 000 tokens, far exceeding the 200 K token window of Claude Sonnet. The result is either a prompt that won’t fit or a degraded attention pattern that leads to mis‑picks.

The semantic router pattern separates the catalog from the prompt: each skill is stored once in an embedding index, and at task time the agent performs a single vector search to retrieve the most relevant candidates. Dmytro Klymentiev ran a controlled experiment to see whether this approach actually works in practice.

Featured image


Why Progressive Disclosure Falls Short

Anthropic’s “progressive disclosure” loads only the skill bodies on demand, but the names + descriptions are still read for every skill at startup. The token cost scales linearly:

Skills Approx. tokens (names+descriptions) Share of 200 K window
100 ~2.5 K 1.3 %
1 000 ~50 K 25 %
4 000 ~200 K 100 % (overflow)

Beyond the raw token budget, a long list of similar items reduces the model’s ability to focus, and there is no built‑in garbage collection – stale or duplicate skills accumulate indefinitely.

The Semantic Router Experiment

Corpus & Indexing

  • Source: antigravity-awesome‑skills, a public repository of Anthropic‑format skill markdown files.
  • Size: 4 556 SKILL.md files, deduplicated by directory.
  • Indexed subset: After bulk‑ingest failures, 686 skills were successfully embedded.
  • Embedding model: intfloat/multilingual-e5-base via sentence‑transformers (768‑dimensional vectors). Stored in PostgreSQL with the pgvector extension.
  • Index payload: name + "\n\n" + description (≈50‑200 tokens per entry).

Query Set

Eight task descriptions were crafted before looking at the corpus, covering common developer workflows:

  1. deploy docker production
  2. analyze stock market data
  3. write marketing email
  4. optimize slow SQL query
  5. security audit web app
  6. set up CI/CD pipeline python
  7. debug memory leak C++
  8. build React TypeScript component

For each query the router returned the top‑5 most similar skills (cosine similarity) and the results were manually judged.

Metrics

Metric Definition
Strict top‑1 First result is the exact skill a human would pick.
Loose top‑1 First result is in the right family but not a perfect match.
Top‑5 cluster At least one of the five results is a strong match the agent could use.

Results

Query Strict top‑1 (skill, similarity) Top‑5 verdict
deploy docker production azd-deployment (0.86) YES (3/5 are deploy‑related)
analyze stock market data xvary-stock-research (0.87) YES (relevant at #4)
write marketing email copywriting (0.86) YES (blog‑writing, writer)
optimize slow SQL query food-database-query (0.85) NO (no true SQL‑tuning skill)
security audit web app laravel-security-audit (0.88) YES (aws‑security, burp‑suite, etc.)
set up CI/CD pipeline python gitlab-ci-patterns (0.87) YES (circleci‑automation at #2)
debug memory leak C++ c-pro (0.86) YES (gdb‑cli, systematic‑debugging)
build React TypeScript component react-flow-node-ts (0.88) YES (5/5 frontend‑relevant)

Strict top‑1 accuracy: 5 / 8 = 62.5 %

Top‑5 cluster accuracy: 7 / 8 = 87.5 %

Latency: sub‑second per query on a single‑CPU container.

Scaling Curve

The test was repeated after each batch of 100 newly indexed skills. The top‑5 cluster metric plateaued around 85 % after 500 skills, while strict top‑1 kept climbing, reaching 62.5 % only after the final 686‑skill batch. This shows that the router quickly finds a relevant family of skills; additional indexing mainly improves the chance of hitting the exact desired skill.


Token Savings

Approach Tokens per turn % of 200 K window
Default loading (4 556 skills) ~228 K >100 % (won’t fit)
Semantic router (top‑5 + body) < 2 K ~1 %

The router therefore delivers a ~456× reduction in context usage for the same task, freeing the majority of the prompt for actual work.


When the Router Misses

The SQL‑query case illustrates the limitation: the exact sql‑optimization‑patterns skill existed in the full corpus but was not part of the indexed 686‑skill sample. The router returned a plausible but wrong skill (food-database-query). In practice, this means router accuracy is bounded by how complete the indexed subset is, not by the quality of the embedding model.


Operational Pain Points

  • Embedding stalls – 86 of 772 documents (≈11 %) got stuck in the “pending” state due to a worker‑error limit.
  • Bulk‑ingest loss – ~20 % of submitted documents never appeared, likely filtered out for being empty or malformed.
  • Worker concurrency – Raising NUM_WORKERS from the default 3 to 10 raised throughput from ~9 → 38 docs/min without degrading quality.
  • Query timeouts – One out of eight queries timed out on a cold cache; rerunning succeeded.

These are engineering‑level issues that can be mitigated with better monitoring and by restarting the mesh‑memory workers when they hit the error threshold.


Reproducing the Test

All code and data are open source:

To try it yourself, drop the skill folder into a running Mesh instance, run the Python script (≈70 lines), and observe the top‑5 results for any custom task description.


Takeaways for Practitioners

  1. Below ~30 skills – simple eager loading is fine; the token overhead is negligible.
  2. 100‑1 000 skills – start using a semantic router to keep the prompt under control.
  3. >1 000 skills – the router becomes the only viable strategy; token savings exceed 100× and the model’s attention stays focused.
  4. Maintain index freshness – periodic re‑embedding and deduplication are required to avoid stale or duplicate entries.
  5. Monitor embedding workers – the “MAX_CONSECUTIVE_ERRORS” limit can silently drop documents; a watchdog that restarts workers on stall is recommended.

Outlook

The experiment validates the core hypothesis: a lightweight vector search can replace the naïve progressive‑disclosure catalog without sacrificing relevance. As skill ecosystems grow into the tens of thousands, the pattern will likely become a standard component of any Claude‑based or similar LLM‑agent platform.

Future work could explore:

  • Hybrid retrieval (semantic + BM25) for rare or highly technical skills.
  • Multi‑turn routing where the agent refines its query based on intermediate results.
  • Automatic detection of “missing” skills and on‑the‑fly embedding of new markdown files.

Dmytro Klymentiev is an AI infrastructure engineer focused on multi‑agent orchestration and developer productivity tools. Follow him on X @DmytroKlymentiev_2hlgbbuo.

Comments

Loading comments...