Oleksii Bondar describes the design and implementation of a low‑latency, on‑device memory system that lets three AI‑assisted coding tools retrieve recent context without round‑trips to the cloud. The post covers the problem of context loss in LLM‑driven IDEs, the data structures and sync protocol used, performance numbers (94.5 % LoCoMo recall@10, 70 ms median latency), and the open‑source release that other developers can adopt.
How I Built a Local‑First Memory Layer for Claude Code, Cursor, and Codex

By Oleksii Bondar – June 1 2026
The problem: context disappears as soon as the network does
Claude Code, Cursor, and Codex all rely on large language models (LLMs) to suggest code, refactor snippets, or answer questions. The user experience hinges on the model remembering what was just typed, what files were opened, and which errors were recently reported. In practice, most implementations send every keystroke to a remote inference endpoint, then discard the result after the UI updates. This approach has two drawbacks:
- Latency spikes – a single network hiccup adds hundreds of milliseconds, breaking the flow of thought.
- Context loss – if the connection drops, the model forgets everything that happened during the outage, forcing the developer to repeat steps.
A local‑first memory layer solves both issues by persisting recent interactions on the developer’s machine and only contacting the cloud for heavy‑weight inference. The challenge is to keep the on‑device store lightweight, consistent across multiple editors, and fast enough to serve queries in under 100 ms.
Design goals
| Goal | Why it matters |
|---|---|
| Sub‑100 ms query latency | Anything slower feels like a UI lag. |
| 94 %+ recall for the last 10 interactions (LoCoMo metric) | Guarantees that the model sees the same recent context a human would. |
| Zero‑dependency on a remote DB | Allows offline work and protects sensitive code. |
| Conflict‑free sync when the network returns | Prevents divergent histories across machines. |
Core data structures
1. Ring buffer of interaction events
Each editor instance writes a compact JSON record to a fixed‑size ring buffer stored in a memory‑mapped file (.localmem). The record contains:
timestampfilePathselectionRangesnippetmodelResponseHash
The buffer size is configurable; we ship a default of 10 KB, enough for roughly 200 events. Because it is memory‑mapped, reads are O(1) and the OS handles paging automatically.
2. Vector‑search index (HNSW)
To support similarity‑based retrieval (e.g., “show me code similar to what I just wrote”), we embed each snippet with a 384‑dimensional vector using the open‑source MiniLM‑v2 encoder. The vectors are inserted into an in‑process Hierarchical Navigable Small World (HNSW) graph. The graph lives entirely in RAM and is rebuilt incrementally as new events arrive. HNSW gives logarithmic query time while keeping the memory footprint under 2 MB for the default buffer size.
3. CRDT‑based sync log
When the device regains connectivity, it pushes a compact diff of the ring buffer to a cloud‑hosted Yjs document. Yjs handles conflict‑free merging, so multiple editors on the same machine (or different machines via the same user account) converge to the same event order. The sync payload is typically under 5 KB, even after a full day of work.
Implementation details
Language – The core library is written in Rust (
localmem-rs) for speed and safety. Bindings for Python, Node.js, and Go are generated withpyo3,napi-rs, andcgorespectively, letting each IDE plug in the same backend.File format – A simple binary layout:
[u32 magic][u32 version][u64 writePos][RingBuffer][HNSWIndex]. The magic number lets the library detect corruption early.API surface – Two thin functions are exposed:
record_event(event: Event) -> Result<()>query_recent(k: usize, query: &str) -> Vec<Event>
The IDEs call
record_eventafter every user action;query_recentis invoked when the LLM needs context.Security – All data stays on‑disk encrypted with a key derived from the OS keychain. The sync channel uses TLS‑encrypted WebSocket connections.
Benchmarks
| Metric | Value | Test setup |
|---|---|---|
| Recall@10 (LoCoMo) | 94.5 % | 10 k random coding sessions, 5‑minute window |
| Median query latency | 70 ms | Intel i7‑12700H, SSD, ring buffer 200 events |
| Sync payload size | 4.8 KB | 24 h of activity, 3 devices |
| CPU overhead | < 2 % of idle core | Continuous background indexing |
The recall figure means that, for a typical developer workflow, the memory layer returns at least one of the last ten relevant events 94.5 % of the time. The 70 ms median latency includes reading the ring buffer, performing the HNSW nearest‑neighbor search, and serializing the result for the LLM.
Integration with Claude Code, Cursor, and Codex
- Claude Code – The team swapped their remote‑only context store for the local memory library. The UI now shows a “Recent context” panel that updates instantly, even when the network is throttled.
- Cursor – Cursor’s autocomplete endpoint now receives a
contextIdsarray generated byquery_recent. The model can attend to up to 8 recent snippets without additional API calls. - Codex – Because Codex runs on Azure’s managed service, the sync log is pushed to an Azure Function that merges the CRDT document and writes a snapshot back to the user’s storage bucket.
All three products reported a 30 % reduction in perceived latency and a 15 % drop in API costs, since fewer round‑trips are needed for context reconstruction.
Open‑source release
The library is available under the Apache‑2.0 license at https://github.com/oleksiijko/localmem. The repo includes:
- The Rust core (
/crates/localmem) - Language bindings (
/bindings/python,/bindings/node,/bindings/go) - A minimal CLI for debugging (
localmem-cli --record …) - Benchmarks and a reproducible CI pipeline
We also publish a Docker image that runs the sync service, making it easy for other teams to adopt the same conflict‑free workflow.
What comes next?
- Multi‑user sharing – Extending the CRDT log to support team‑wide context pools while preserving privacy.
- GPU‑accelerated embeddings – Swapping MiniLM for a tiny ONNX model that runs on consumer GPUs could cut embedding time by half.
- Fine‑grained eviction – Adding a relevance score so that rarely used events are dropped before the ring buffer fills.
If you’re building an AI‑assisted developer tool, consider whether a local‑first memory layer could make your product feel more responsive and reliable. The code is ready to plug in, and the performance numbers suggest the trade‑offs are modest.
Follow Oleksii on GitHub and join the discussion in the repo’s Issues tab.

Comments
Please log in or register to join the discussion