Building an Append-Only Event Store: What Interns Taught Me About Distributed Systems Fundamentals
#Infrastructure

Building an Append-Only Event Store: What Interns Taught Me About Distributed Systems Fundamentals

Backend Reporter
5 min read

Two intern projects revealed why append-only logs, crash recovery, and careful API design matter more than most engineers realize. The lessons apply far beyond the internship.

Featured image

Every distributed system eventually boils down to two questions: where does the data live, and what happens when things break? I watched two interns grapple with these questions recently, and their solutions illuminated patterns that production systems handle the same way, just with more budget.

The first task was building an append-only event store from scratch. No databases, no frameworks that abstract away the hard parts. The second was a team management API with invitations, soft deletes, and batch operations. Both projects forced confrontation with problems that senior engineers often forget they solved years ago.

The Append-Only Event Store: Why Logs Are the Source of Truth

The specification was deceptively simple: accept events via HTTP, store them in an append-only log file, and read them back by ID. On restart, rebuild the index from the file. No data loss allowed.

This is exactly how every serious database handles durability. PostgreSQL has its Write-Ahead Log. Redis uses AOF. Kafka is literally a distributed log. The pattern exists because it solves a fundamental problem: crash recovery without corruption.

The key insight is that you never edit existing data. You append. If power fails mid-write, the worst case is losing that single incomplete line. Everything before it remains intact. This is a trade-off you make in exchange for write performance and simplicity. You lose the ability to update in place, but you gain a system where recovery is straightforward.

The intern's implementation used a JavaScript Map as the index, mapping event IDs to file offsets. On startup, it replayed the log to rebuild this index. This is textbook log-structured storage.

The Unicode Bug: Bytes vs. Characters

The hardest bug to find was a classic byte-length versus character-length mismatch. An emoji like "🔥" is one character but four bytes in UTF-8. The index stored offsets calculated using string.length (character count), but file operations work in bytes. When the server restarted and tried to parse the log, offsets were wrong by three bytes for any emoji-containing events.

The fix was using Buffer.byteLength(line, 'utf8') instead of .length. This distinction matters in any system that stores strings and tracks positions. Databases handle this correctly because they have to. Building it from scratch forces you to understand why.

The Log vs. Index Distinction

One takeaway deserves emphasis: the source of truth is never the index. The index is a cache. If it disappears, you rebuild it from the log. This principle scales up. In distributed systems, the commit log is authoritative. Replicas are derived views. Snapshots are optimizations. When in doubt, trust the log.

This changes how you think about caching at every level. A cache that can be reconstructed from authoritative data is safe to lose. A cache that is the only copy of data is a liability.

The Teams Management API: Consistency in Batch Operations

The second project involved building an admin teams management system: CRUD operations on teams, member invitations with email delivery, secure token generation, and soft deletes. Three database tables, six API endpoints, and email integration.

Token Security: Hashing Done Right

The invitation system required cryptographic tokens. The pattern is standard: generate a random token, store its SHA-256 hash in the database, send the raw token in the email link. When the user clicks, hash the incoming token and compare.

The intern stored the raw token directly at first. This is a security violation. If the database leaks, attackers have valid invitation tokens. The hashed approach means a database leak exposes only hash values, which cannot be reversed to usable tokens.

This is a common mistake. The trade-off is between simplicity (store what you generate) and security (store only what cannot be misused). In production systems, you assume the database will eventually be exposed. Design accordingly.

Batch Processing and Race Conditions

Inviting multiple emails per request introduced concurrency issues. Each email needed validation against existing pending invites and team memberships. But these checks were not atomic. Two simultaneous requests could both pass the uniqueness check and create duplicate invites.

The solution combined application-level checks with database unique constraints as a backstop. This is a practical approach to consistency: try to prevent violations in your code, but use the database as a safety net. Perfect atomicity across distributed checks is expensive. Probabilistic prevention plus constraint enforcement is usually sufficient.

Environment Variable Failures

The invitation links pointed to wrong URLs because APP_URL and FRONTEND_URL were different environment variables, and the codebase used the wrong one. This is a configuration consistency problem. In distributed systems, configuration drift causes outages. The lesson: validate that required environment variables exist and match expectations before using them. Fail fast with clear error messages.

DevRelCon image

Patterns That Scale Down and Up

Both projects demonstrated patterns that appear at every scale:

Append-only logs solve durability. Whether you are building a single-file event store or a distributed streaming platform, the principle is the same. Never edit. Append. Rebuild derived state from the log.

Hash before storing secrets applies to tokens, passwords, API keys, and any credential. The hash is not the data. It is a comparison value. This distinction protects you when (not if) storage is compromised.

Batch operations need failure isolation. Processing twenty emails should not fail entirely because two are invalid. Return partial results. Track which items succeeded and which failed. This is the difference between a system that degrades gracefully and one that breaks completely.

Configuration should be validated at startup. Missing or mismatched environment variables cause runtime failures that are hard to diagnose. Check early. Fail loudly. Provide actionable error messages.

The Internship as a Systems Engineering Bootcamp

The HNG Internship throws people into these problems without boilerplate or copy-paste solutions. The struggle is intentional. Building a key-value store from files teaches more about databases than any tutorial. Implementing batch invitations with email delivery teaches more about API design than any course.

The panic at 2 AM when a race condition surfaces or a Unicode bug corrupts data is the actual learning. Production systems hide these problems behind layers of abstraction. Building from scratch forces confrontation with the fundamentals.

Both projects broke in different ways. Both rebuilt the engineers stronger. The append-only event store taught that logs are truth and indexes are caches. The teams API taught that batch operations, security, and configuration consistency are not optional.

These are the same lessons distributed systems engineers learn at scale, just with more budget and more consequences. The internship compressed that learning into individual projects with tight deadlines. That is the value.

Comments

Loading comments...