An empirical test of the “skills as semantic router” pattern shows that indexing a community‑wide skill catalog into a vector store reduces Claude Code’s prompt size from ~228 K tokens to under 2 K per task, while still returning the right skill in the top‑5 for 87.5% of queries.
Introduction
When Claude Code agents load a skill catalog, they traditionally dump every skill’s name and description into the system prompt. For a community collection of 4,556 skills that means roughly 228 000 tokens, far exceeding the 200 K token window of Claude Sonnet. The result is either a prompt that won’t fit or a degraded attention pattern that leads to mis‑picks.
The semantic router pattern separates the catalog from the prompt: each skill is stored once in an embedding index, and at task time the agent performs a single vector search to retrieve the most relevant candidates. Dmytro Klymentiev ran a controlled experiment to see whether this approach actually works in practice.

Why Progressive Disclosure Falls Short
Anthropic’s “progressive disclosure” loads only the skill bodies on demand, but the names + descriptions are still read for every skill at startup. The token cost scales linearly:
| Skills | Approx. tokens (names+descriptions) | Share of 200 K window |
|---|---|---|
| 100 | ~2.5 K | 1.3 % |
| 1 000 | ~50 K | 25 % |
| 4 000 | ~200 K | 100 % (overflow) |
Beyond the raw token budget, a long list of similar items reduces the model’s ability to focus, and there is no built‑in garbage collection – stale or duplicate skills accumulate indefinitely.
The Semantic Router Experiment
Corpus & Indexing
- Source: antigravity-awesome‑skills, a public repository of Anthropic‑format skill markdown files.
- Size: 4 556 SKILL.md files, deduplicated by directory.
- Indexed subset: After bulk‑ingest failures, 686 skills were successfully embedded.
- Embedding model:
intfloat/multilingual-e5-basevia sentence‑transformers (768‑dimensional vectors). Stored in PostgreSQL with thepgvectorextension. - Index payload:
name + "\n\n" + description(≈50‑200 tokens per entry).
Query Set
Eight task descriptions were crafted before looking at the corpus, covering common developer workflows:
- deploy docker production
- analyze stock market data
- write marketing email
- optimize slow SQL query
- security audit web app
- set up CI/CD pipeline python
- debug memory leak C++
- build React TypeScript component
For each query the router returned the top‑5 most similar skills (cosine similarity) and the results were manually judged.
Metrics
| Metric | Definition |
|---|---|
| Strict top‑1 | First result is the exact skill a human would pick. |
| Loose top‑1 | First result is in the right family but not a perfect match. |
| Top‑5 cluster | At least one of the five results is a strong match the agent could use. |
Results
| Query | Strict top‑1 (skill, similarity) | Top‑5 verdict |
|---|---|---|
| deploy docker production | azd-deployment (0.86) |
YES (3/5 are deploy‑related) |
| analyze stock market data | xvary-stock-research (0.87) |
YES (relevant at #4) |
| write marketing email | copywriting (0.86) |
YES (blog‑writing, writer) |
| optimize slow SQL query | food-database-query (0.85) |
NO (no true SQL‑tuning skill) |
| security audit web app | laravel-security-audit (0.88) |
YES (aws‑security, burp‑suite, etc.) |
| set up CI/CD pipeline python | gitlab-ci-patterns (0.87) |
YES (circleci‑automation at #2) |
| debug memory leak C++ | c-pro (0.86) |
YES (gdb‑cli, systematic‑debugging) |
| build React TypeScript component | react-flow-node-ts (0.88) |
YES (5/5 frontend‑relevant) |
Strict top‑1 accuracy: 5 / 8 = 62.5 %
Top‑5 cluster accuracy: 7 / 8 = 87.5 %
Latency: sub‑second per query on a single‑CPU container.
Scaling Curve
The test was repeated after each batch of 100 newly indexed skills. The top‑5 cluster metric plateaued around 85 % after 500 skills, while strict top‑1 kept climbing, reaching 62.5 % only after the final 686‑skill batch. This shows that the router quickly finds a relevant family of skills; additional indexing mainly improves the chance of hitting the exact desired skill.
Token Savings
| Approach | Tokens per turn | % of 200 K window |
|---|---|---|
| Default loading (4 556 skills) | ~228 K | >100 % (won’t fit) |
| Semantic router (top‑5 + body) | < 2 K | ~1 % |
The router therefore delivers a ~456× reduction in context usage for the same task, freeing the majority of the prompt for actual work.
When the Router Misses
The SQL‑query case illustrates the limitation: the exact sql‑optimization‑patterns skill existed in the full corpus but was not part of the indexed 686‑skill sample. The router returned a plausible but wrong skill (food-database-query). In practice, this means router accuracy is bounded by how complete the indexed subset is, not by the quality of the embedding model.
Operational Pain Points
- Embedding stalls – 86 of 772 documents (≈11 %) got stuck in the “pending” state due to a worker‑error limit.
- Bulk‑ingest loss – ~20 % of submitted documents never appeared, likely filtered out for being empty or malformed.
- Worker concurrency – Raising
NUM_WORKERSfrom the default 3 to 10 raised throughput from ~9 → 38 docs/min without degrading quality. - Query timeouts – One out of eight queries timed out on a cold cache; rerunning succeeded.
These are engineering‑level issues that can be mitigated with better monitoring and by restarting the mesh‑memory workers when they hit the error threshold.
Reproducing the Test
All code and data are open source:
- Runner & queries – https://github.com/dmytrokl/semantic‑router‑test
- Skill corpus – https://github.com/antigravity‑awesome‑skills
- Mesh‑memory server – https://github.com/antigravity‑awesome/mesh‑memory (MIT‑licensed)
- Embedding script – uses
sentence‑transformersandpgvector.
To try it yourself, drop the skill folder into a running Mesh instance, run the Python script (≈70 lines), and observe the top‑5 results for any custom task description.
Takeaways for Practitioners
- Below ~30 skills – simple eager loading is fine; the token overhead is negligible.
- 100‑1 000 skills – start using a semantic router to keep the prompt under control.
- >1 000 skills – the router becomes the only viable strategy; token savings exceed 100× and the model’s attention stays focused.
- Maintain index freshness – periodic re‑embedding and deduplication are required to avoid stale or duplicate entries.
- Monitor embedding workers – the “MAX_CONSECUTIVE_ERRORS” limit can silently drop documents; a watchdog that restarts workers on stall is recommended.
Outlook
The experiment validates the core hypothesis: a lightweight vector search can replace the naïve progressive‑disclosure catalog without sacrificing relevance. As skill ecosystems grow into the tens of thousands, the pattern will likely become a standard component of any Claude‑based or similar LLM‑agent platform.
Future work could explore:
- Hybrid retrieval (semantic + BM25) for rare or highly technical skills.
- Multi‑turn routing where the agent refines its query based on intermediate results.
- Automatic detection of “missing” skills and on‑the‑fly embedding of new markdown files.
Dmytro Klymentiev is an AI infrastructure engineer focused on multi‑agent orchestration and developer productivity tools. Follow him on X @DmytroKlymentiev_2hlgbbuo.

Comments
Please log in or register to join the discussion