What Two Backend Projects Teach About Durability, Concurrency, and API Boundaries
#Backend

What Two Backend Projects Teach About Durability, Concurrency, and API Boundaries

Backend Reporter
10 min read

The useful lessons in backend engineering usually arrive when a clean design meets byte offsets, concurrent writes, and data that refuses to behave.

Featured image

Problem

The HNG14 internship reflection describes two backend projects that look modest from the outside but expose the work databases, queues, and mature platforms usually hide: Eventrail, an append-only event store built on a plain log file, and SkillBridge CredLane's employer assessment system, a multi-table feature with scoring, invites, uploads, and concurrency constraints.

Both projects sit in the same family of engineering problems. They ask what happens when state must survive failure, when two requests arrive at the same time, and when an API must protect its own invariants instead of trusting the client. That is where backend work stops being route handlers plus ORM calls and starts becoming systems design.

Eventrail removes the database entirely. Every event is serialized as JSON and appended to events.log as newline-delimited data. Reads use an in-memory index from event ID to byte offset and length. On restart, the service reconstructs that index from the log. This is a small version of the idea behind write-ahead logging, event sourcing, and storage engines. The log is the source of truth. The index is disposable acceleration.

The assessment system has a different shape. It uses the database heavily, but the hard parts are still about state transitions. An employer can create only a limited number of active assessments. A candidate should submit only once. Correct answers must remain server-side. Public links must expose questions without leaking scoring data. CSV and XLSX imports must accept real-world files without crashing the service. Each requirement sounds local, but together they create a distributed-systems problem inside a normal web application.

That is the lesson: most production failures are not exotic. They come from an assumption that held during a single request and failed under concurrency, crash recovery, encoding, partial writes, or hostile input.

Solution Approach

Eventrail starts with the simplest reliable storage model: append, never mutate. A POST /events request receives arbitrary JSON, assigns an ID and timestamp, serializes the event, and writes one line to the end of the file. GET /events/:id reads one event back. GET /stats reports totals.

The implementation detail that matters is the index. Scanning a file for every read works at ten events and fails quietly at scale. A map from ID to { offset, length } changes the read path from O(n) scanning to direct access. Startup recovery then becomes deterministic: read the log line by line, parse each event, calculate its byte range, rebuild the map, and treat the file as truth.

This design mirrors a common storage pattern. Durable storage and query acceleration are separate concerns. The append-only log preserves facts. The in-memory index makes reads cheap. If the process dies, the index can be rebuilt. If the index is corrupt, the log can repair it. Systems such as Kafka, PostgreSQL write-ahead logging, and event-sourced applications all rely on versions of this separation.

The first serious bug was a byte accounting bug. JavaScript string length counts UTF-16 code units, not UTF-8 bytes on disk. ASCII payloads masked the mistake. Unicode payloads exposed it. An event containing non-Latin characters or emoji made offsets drift, so reads returned broken JSON. The fix was to calculate positions with Buffer.byteLength(line, 'utf8') rather than .length.

That is not a cosmetic distinction. Storage engines live in bytes. APIs live in strings. Any system that crosses that boundary has to pick the correct unit at the correct layer. A character can occupy multiple bytes. A byte offset into a file cannot be computed from a character count unless the encoding makes that safe, and UTF-8 does not.

The second bug was a write race. Two concurrent requests could observe the same current file position, both compute the same offset, then write different events. The file would contain both events, but the index could point one ID at the wrong byte range. The fix was a promise-based write queue that serializes append operations inside the process.

That queue is not fancy, but it matches the scope of the system. A single process writing to a single file needs mutual exclusion around offset calculation, file append, and index update. Once multiple processes can write to the same log, the design needs an operating-system file lock, a single writer service, or a real database. Correctness depends on where concurrency is allowed to enter.

The SkillBridge CredLane assessment system starts from relational modeling instead. It introduces tables for assessments, questions, invites, and submissions. The schema encodes invariants with checks and constraints: allowed time limits, valid passing thresholds, valid score ranges, foreign keys, and uniqueness where duplicate business events must be impossible.

That database-first posture matters. Application validation is useful for clear errors and early rejection, but it is not the final authority. If two application instances race, or a future code path forgets a check, the database is the component still capable of refusing invalid state. PostgreSQL constraints are not paperwork. They are part of the concurrency design.

Assessment creation is a classic example. The business rule says an employer can have at most three active assessments. The naive code path is count then insert. Count active assessments. If the count is below three, create a new one. That works until two requests run at the same time. Both see count two. Both insert. The employer now has four active assessments.

The fix uses a transaction and pessimistic locking. Before counting, the service locks the employer row with SELECT ... FOR UPDATE. The second request waits until the first transaction commits, then reads the new count. PostgreSQL documents this behavior in its explicit locking guide, and it is a practical tool when a rule is tied to a shared aggregate rather than a single inserted row.

Duplicate submissions use a related but different pattern. A candidate should not submit the same assessment twice. The application can check for an existing submission, but that check has the same race as count then insert. The correct backstop is a unique database constraint on (assessment_id, candidate_user_id). The service can still perform the check for a clean user-facing error, but the database constraint wins the race. If PostgreSQL raises a 23505 unique violation, the API translates that into a conflict response.

This is the right division of labor. The application handles intent, authorization, response shape, and friendly errors. The database handles non-negotiable facts.

Scoring follows the same principle. The client submits selected answers, not scores. The service compares submitted answers against stored correct answers, normalizes strings, computes the percentage, and stores whether the candidate passed. The public assessment endpoint strips correct answers before returning questions. That keeps the API boundary honest: clients may display and collect, but they do not decide truth.

The file import path adds a different kind of failure mode. CSV parsing has to handle quoted fields, embedded commas, and escaped quotes. XLSX is more complex because an .xlsx file is a ZIP archive containing XML documents. A minimal parser has to locate the central directory, inflate compressed entries, resolve workbook relationships, parse shared strings, and extract cell values from sheet XML. The Office Open XML format is documented, but even a narrow implementation forces the service to treat uploaded files as structured binary input rather than friendly text.

Gen AI apps are built with MongoDB Atlas

Trade-offs

Eventrail's append-only design is easy to reason about because it narrows the write path. Appending avoids in-place mutation. Recovery is transparent. A corrupted or missing in-memory index can be rebuilt from durable data. The price is that compaction, deletion, and multi-process coordination are not free.

An append-only log grows forever unless the system introduces snapshots, segment files, tombstones, or retention policies. Those features add complexity because they change the simple invariant from “all history is in one file” to “history may be split across active segments, compacted segments, and snapshots.” That is where real log-structured systems earn their complexity.

The in-memory index has similar trade-offs. It makes reads fast, but memory usage grows with event count. Rebuilding on startup is fine for hundreds or thousands of events. At larger sizes, startup time becomes operationally relevant. The next step might be a persisted index, segmented logs, or a database. Each option shifts cost between write latency, startup recovery, memory, and implementation complexity.

The promise-based write queue is also scoped. It provides in-process serialization. It does not solve multiple Node.js workers, multiple containers, or NFS behavior. A production version would need a clear single-writer model or a storage system with well-defined concurrent append semantics. This is a recurring pattern in backend systems: a correct local lock can become incorrect when the deployment topology changes.

The assessment system makes the opposite trade-off. It relies on the database for integrity, which is usually the right call for business workflows. Constraints, transactions, and locks keep state coherent across application instances. The cost is that database contention becomes part of the design.

Pessimistic locking is clear and correct for the active-assessment limit, but it serializes creation per employer. That is probably acceptable because assessment creation is not a high-frequency path. It would be a poor fit for a hot counter updated thousands of times per second. For a hotter workload, the design might use a materialized quota row, an atomic update with a conditional predicate, or a queue. The correct mechanism depends on contention, latency budget, and how expensive a rejected operation is.

Unique constraints for duplicate submissions are almost always worth it. They are simple, local, and reliable. The trade-off is API design: the service must translate database errors into stable domain errors rather than leaking raw SQL failures. That translation layer is not optional if clients depend on predictable responses.

Server-side scoring also carries product trade-offs. It protects correctness, but it requires careful versioning. If an employer edits a question after candidates have taken an assessment, should old submissions be scored against the old answer or the new one? A mature design would snapshot assessment questions at publish time or make published assessments immutable. Otherwise, the system can accidentally rewrite history.

The file parser is the most debatable decision. Avoiding a dependency can reduce package risk and make the implementation easier to audit, but file formats are full of edge cases. A minimal XLSX parser is acceptable when the accepted format is narrow and well tested. It becomes risky if users expect full Excel compatibility: formulas, merged cells, dates, styles, shared string variants, multiple sheets, and unusual compression cases. The pragmatic approach is to define the supported input contract clearly and reject anything outside it with useful errors.

The broader API pattern across both projects is defensive ownership. The API accepts intent, not authority. A client can ask to create an event, submit answers, or import questions. The service assigns IDs, computes offsets, scores submissions, enforces limits, and persists facts. That distinction is what keeps a backend from becoming a thin proxy for client assumptions.

The strongest engineering lesson from these projects is that reliability is built from small, boring guarantees that compose. Use bytes for byte offsets. Serialize writes when offset and append must agree. Rebuild derived state from source-of-truth data. Put invariant checks in the database. Treat count then insert as suspicious. Make duplicate business events impossible with unique constraints. Keep scoring and authorization server-side.

Tutorials often teach the happy path because the happy path is where an API first becomes visible. Production systems fail in the spaces between happy paths: restart during write, two requests at once, Unicode in a payload, a malformed spreadsheet, a client that submits twice, or a public endpoint that returns one field too many. These HNG14 projects matter because they forced those spaces into the design instead of leaving them as future cleanup.

Comments

Loading comments...