A deep dive into how a well-intentioned Rust implementation at Oxen led to 50-minute commit times, and the simple design change that delivered 20X performance gains.
Oxen has positioned itself as the fastest data versioning tool in the market, and with good reason. The team regularly runs multi-terabyte benchmarks on every command they support, from add to commit, pushing the boundaries of what's possible in version control performance.
When I joined the team, I was immediately drawn to this relentless pursuit of speed. Our add command was already impressive—processing 1 million files in about a minute. But something was off with commit. While add hummed along efficiently, commit was taking over 50 minutes to complete. This was puzzling because the algorithm itself is straightforward: O(n) complexity where n equals the number of directories. With a million files, we're still talking about a relatively modest number of directory nodes to create—just one commit node and n directory nodes.
This is where the investigation began. Following standard performance debugging protocol, I fired up samply, a remarkably convenient profiling tool that lets you visualize binary performance in a clean UI. The results were eye-opening: over 90% of the execution time was spent simply acquiring locks on the staging RocksDB.
The root cause was subtle but significant. Our parallel workers were efficiently graduating files from staged to committed by creating the appropriate directory nodes and building a complete repository view for each commit. However, because we were passing file data and metadata between different layers of our codebase, we ended up with a cascade of .clone() operations and repeated db.open() calls at various levels—all fetching the same data over and over.
Here's where the lesson truly hit home. The fix was elegantly simple: reduce the data passing between layers and minimize redundant database operations. The final pull request implementing this change looked straightforward, but the impact was dramatic—a 20X performance improvement that brought commit times down from 50+ minutes to just a few.
This experience reinforced a principle I often reflect on, particularly a quote from Fabien that resonates deeply in performance engineering: "In performance-sensitive applications, less is more and simple is better." In our case, this was absolutely true.
The irony wasn't lost on me. We had architected our system with excellent software engineering principles—clear separation of concerns, well-defined interfaces between layers, and modular design. But these very principles, when taken to an extreme in a performance-critical context, led to repeated context retrieval for each thread and async operation. This created thread contention for the same resource, bottlenecking the entire process.
This became a personal lesson in thinking holistically about system design, even when implementing seemingly isolated features. Good software architecture isn't just about clean separation—it's about understanding how those separations impact performance at scale.
As an interesting aside, this experience also highlighted some limitations of our technology choices. RocksDB, while excellent for many use cases, isn't ideally suited for our "parallel reads" scenario. It's optimized for parallel writes, which explains the unexpected overhead we encountered. Sometimes the tools we choose for one set of requirements can create surprising challenges when our usage patterns evolve.
In the end, this was a powerful reminder that performance engineering often requires looking beyond the obvious algorithmic complexity. Sometimes the biggest gains come not from optimizing the core algorithm, but from rethinking how data flows through your system and eliminating unnecessary work at the architectural level.


Comments
Please log in or register to join the discussion