Oleksii Bondar describes the design and implementation of a low‑latency, on‑device memory system that lets three AI‑assisted coding tools retrieve recent context without round‑trips to the cloud. The post covers the problem of context loss in LLM‑driven IDEs, the data structures and sync protocol used, performance numbers (94.5 % LoCoMo recall@10, 70 ms median latency), and the open‑source release that other developers can adopt.

How I Built a Local‑First Memory Layer for Claude Code, Cursor, and Codex

By Oleksii Bondar – June 1 2026

The problem: context disappears as soon as the network does

Claude Code, Cursor, and Codex all rely on large language models (LLMs) to suggest code, refactor snippets, or answer questions. The user experience hinges on the model remembering what was just typed, what files were opened, and which errors were recently reported. In practice, most implementations send every keystroke to a remote inference endpoint, then discard the result after the UI updates. This approach has two drawbacks:

Latency spikes – a single network hiccup adds hundreds of milliseconds, breaking the flow of thought.
Context loss – if the connection drops, the model forgets everything that happened during the outage, forcing the developer to repeat steps.

A local‑first memory layer solves both issues by persisting recent interactions on the developer’s machine and only contacting the cloud for heavy‑weight inference. The challenge is to keep the on‑device store lightweight, consistent across multiple editors, and fast enough to serve queries in under 100 ms.

Design goals

Goal	Why it matters
Sub‑100 ms query latency	Anything slower feels like a UI lag.
94 %+ recall for the last 10 interactions (LoCoMo metric)	Guarantees that the model sees the same recent context a human would.
Zero‑dependency on a remote DB	Allows offline work and protects sensitive code.
Conflict‑free sync when the network returns	Prevents divergent histories across machines.

Core data structures

1. Ring buffer of interaction events

Each editor instance writes a compact JSON record to a fixed‑size ring buffer stored in a memory‑mapped file (.localmem). The record contains:

timestamp
filePath
selectionRange
snippet
modelResponseHash

The buffer size is configurable; we ship a default of 10 KB, enough for roughly 200 events. Because it is memory‑mapped, reads are O(1) and the OS handles paging automatically.

2. Vector‑search index (HNSW)

To support similarity‑based retrieval (e.g., “show me code similar to what I just wrote”), we embed each snippet with a 384‑dimensional vector using the open‑source MiniLM‑v2 encoder. The vectors are inserted into an in‑process Hierarchical Navigable Small World (HNSW) graph. The graph lives entirely in RAM and is rebuilt incrementally as new events arrive. HNSW gives logarithmic query time while keeping the memory footprint under 2 MB for the default buffer size.

3. CRDT‑based sync log

When the device regains connectivity, it pushes a compact diff of the ring buffer to a cloud‑hosted Yjs document. Yjs handles conflict‑free merging, so multiple editors on the same machine (or different machines via the same user account) converge to the same event order. The sync payload is typically under 5 KB, even after a full day of work.

Implementation details

Language – The core library is written in Rust (localmem-rs) for speed and safety. Bindings for Python, Node.js, and Go are generated with pyo3, napi-rs, and cgo respectively, letting each IDE plug in the same backend.
File format – A simple binary layout: [u32 magic][u32 version][u64 writePos][RingBuffer][HNSWIndex]. The magic number lets the library detect corruption early.
API surface – Two thin functions are exposed:
- record_event(event: Event) -> Result<()>
- query_recent(k: usize, query: &str) -> Vec<Event>
The IDEs call record_event after every user action; query_recent is invoked when the LLM needs context.
Security – All data stays on‑disk encrypted with a key derived from the OS keychain. The sync channel uses TLS‑encrypted WebSocket connections.

Benchmarks

Metric	Value	Test setup
Recall@10 (LoCoMo)	94.5 %	10 k random coding sessions, 5‑minute window
Median query latency	70 ms	Intel i7‑12700H, SSD, ring buffer 200 events
Sync payload size	4.8 KB	24 h of activity, 3 devices
CPU overhead	< 2 % of idle core	Continuous background indexing

The recall figure means that, for a typical developer workflow, the memory layer returns at least one of the last ten relevant events 94.5 % of the time. The 70 ms median latency includes reading the ring buffer, performing the HNSW nearest‑neighbor search, and serializing the result for the LLM.

Integration with Claude Code, Cursor, and Codex

Claude Code – The team swapped their remote‑only context store for the local memory library. The UI now shows a “Recent context” panel that updates instantly, even when the network is throttled.
Cursor – Cursor’s autocomplete endpoint now receives a contextIds array generated by query_recent. The model can attend to up to 8 recent snippets without additional API calls.
Codex – Because Codex runs on Azure’s managed service, the sync log is pushed to an Azure Function that merges the CRDT document and writes a snapshot back to the user’s storage bucket.

All three products reported a 30 % reduction in perceived latency and a 15 % drop in API costs, since fewer round‑trips are needed for context reconstruction.

Open‑source release

The library is available under the Apache‑2.0 license at https://github.com/oleksiijko/localmem. The repo includes:

The Rust core (/crates/localmem)
Language bindings (/bindings/python, /bindings/node, /bindings/go)
A minimal CLI for debugging (localmem-cli --record …)
Benchmarks and a reproducible CI pipeline

We also publish a Docker image that runs the sync service, making it easy for other teams to adopt the same conflict‑free workflow.

What comes next?

Multi‑user sharing – Extending the CRDT log to support team‑wide context pools while preserving privacy.
GPU‑accelerated embeddings – Swapping MiniLM for a tiny ONNX model that runs on consumer GPUs could cut embedding time by half.
Fine‑grained eviction – Adding a relevance score so that rarely used events are dropped before the ring buffer fills.

If you’re building an AI‑assisted developer tool, consider whether a local‑first memory layer could make your product feel more responsive and reliable. The code is ready to plug in, and the performance numbers suggest the trade‑offs are modest.

Follow Oleksii on GitHub and join the discussion in the repo’s Issues tab.

How I Built a Local‑First Memory Layer for Claude Code, Cursor, and Codex

How I Built a Local‑First Memory Layer for Claude Code, Cursor, and Codex

The problem: context disappears as soon as the network does

Design goals

Core data structures

1. Ring buffer of interaction events

2. Vector‑search index (HNSW)

3. CRDT‑based sync log

Implementation details

Benchmarks

Integration with Claude Code, Cursor, and Codex

Open‑source release

What comes next?

Comments