Two Tasks That Changed How I Think About Backend Engineering

A job scheduler and an AI validation pipeline taught the same backend lesson from different angles: correctness depends less on the happy path than on the contracts you verify under failure.

Problem

Backend engineering looks clean in diagrams. Jobs enter a queue, workers pick them up, failed work retries, dashboards stream state, and validation catches bad inputs before they become production data. The diagram is not wrong, but it hides the part that matters most: every boundary lies.

The first task was a production job scheduler built from scratch. It had a heap-based priority queue, worker processes, retry handling with exponential backoff, a dead-letter queue, dependency ordering through DAGs, and an SSE dashboard deployed behind Nginx with HTTPS. On paper, the system was a classic coordination problem: accept work, persist it, claim it once, execute it, and expose enough state that operators can understand what is happening.

The second task was a team project using Zod to validate AI-generated quiz questions. The problem looked different, but it had the same shape. An external producer generated structured data, and the backend needed to decide whether that data was acceptable before trusting it. In this case, the producer was an AI model instead of a user, webhook, queue consumer, or partner API.

Both tasks changed how I think about backend systems because both punished vague contracts. The scheduler failed when I trusted an ORM result shape without proving it. The validation pipeline worked because we treated AI output as hostile until it satisfied explicit rules.

Solution Approach

The scheduler started with the usual API pattern: clients submit jobs, workers claim jobs, and the dashboard subscribes to state changes. The important design decision was to keep job state in a persistent store rather than only in memory. A heap is useful for selecting the next job efficiently, but once multiple workers and deployments enter the picture, the database becomes the source of truth.

A job scheduler is really a small distributed system. Even if it runs on one VM, it has independent actors observing and mutating shared state. The API has to answer questions like these: Can two workers claim the same job? What happens if a worker crashes after claiming but before finishing? Is retry state durable? Are dependency edges checked before execution or only when enqueuing? What does the dashboard show when the database and worker memory disagree?

The core scheduler used a heap-based priority queue to order runnable work. That gives good local performance, since selecting the next item is cheap. But scalability is not only about big-O notation. A priority queue in process memory scales until it becomes inconsistent with the durable state around it. Once workers are separate processes, or once the scheduler restarts, the heap must be treated as a cache over persisted jobs rather than the final authority.

The worker claim path was the critical consistency boundary. The intended behavior was simple: find a pending job, mark it as claimed, and let exactly one worker execute it. In practice, “exactly once” execution is usually too expensive or too optimistic. A more realistic goal is at-least-once execution with idempotent handlers, or effectively-once behavior when the job operation can be guarded by idempotency keys and transactional writes.

That distinction matters. If a job sends an email, charges a card, or mutates external state, retrying blindly can duplicate effects. The scheduler should either require handlers to be idempotent or provide an API for idempotency keys, deduplication windows, and completion records. This is where API design and consistency models meet. The API cannot simply say POST /jobs and pretend the rest is implementation detail. It needs to define what duplicate submissions mean, what retry guarantees exist, and whether callers can safely submit the same logical task twice.

The worst scheduler bug came from TypeORM and its .query() return shape. The raw query result came back as a tuple, effectively [rows, count]. Checking .length on that value returned 2, even when there were zero rows. A lost claim race looked like a successful claim because the code was measuring the wrapper, not the data inside it.

That is the kind of bug that survives casual testing because the call does return something. There is no crash. There is no obvious exception. The code keeps moving with a false belief about state. In a scheduler, that is dangerous because the claim path is the lock. If the system thinks it claimed work that it did not claim, workers can report progress that never happened, retry logic can become confused, and the dashboard can show a state transition that was never durable.

The fix was not clever. It was a tuple unwrapper that normalized ORM results before the scheduler interpreted them. The larger lesson was more durable: never let infrastructure-specific shapes leak into correctness logic. If the database layer can return different envelopes depending on driver or query type, unwrap it once at the boundary and make the rest of the system consume a stable internal contract.

Deployment added a second failure class. The application configuration looked correct, but traffic still failed because Oracle’s iptables had a REJECT rule above the expected ACCEPT rules. That kind of issue is easy to misdiagnose as an Nginx, TLS, DNS, or application problem. The actual failure was lower in the stack. Packets were rejected before the application had a chance to be correct.

That deployment failure reinforced a practical operating rule: production behavior is the composition of application code, runtime configuration, network policy, process supervision, TLS termination, and host firewall state. A backend engineer does not need to become a full-time network administrator, but they do need to be able to follow a request from client to process and prove where it stops.

The AI validation task had a cleaner success story because the contract was explicit from the start. Zod schemas described the expected quiz question shape. Basic fields could be checked with ordinary type and length constraints. The more interesting checks used .refine() to enforce cross-field rules, such as requiring correctAnswer to point to an existing option.

That matters because many malformed AI outputs are structurally plausible. A quiz question can have a string prompt, an array of options, and a numeric answer index while still being invalid. If there are three options and the model returns correctAnswer: 4, the JSON shape is fine, but the domain object is broken.

Feeding Zod errors back into the AI as retry prompts turned validation from a terminal failure into a control loop. The model produced an object, the schema rejected it, the system sent precise validation feedback, and the model tried again. This is a useful API pattern for AI-backed systems: do not ask the model to be correct by convention. Put a validator after it, then make the retry path consume machine-readable error messages.

The pattern generalizes beyond quiz generation. Any AI system that emits JSON, tool calls, database mutations, policy decisions, or UI configuration needs a boundary validator. Zod is one option in TypeScript, with documentation at zod.dev, but the important idea is language-independent. Treat generated output like untrusted external input. Validate structure, validate domain rules, reject or repair failures, and record enough error detail to debug repeated bad outputs.

Trade-Offs

The scheduler’s design made failure visible, but each reliability feature added operational complexity. Retries with exponential backoff reduce pressure on failing dependencies, but they also delay recovery and require careful tuning. A dead-letter queue protects the main queue from poison jobs, but somebody has to inspect, replay, or discard those jobs. DAG dependencies model real workflows, but they also create scheduling edge cases around cycles, partial failure, and blocked descendants.

The SSE dashboard, built on Server-Sent Events, was a good fit for one-way status updates. SSE is simpler than WebSockets when the browser only needs to receive events, and it works well for dashboards where the server publishes job state changes. The trade-off is connection management. Long-lived HTTP connections interact with proxies, timeouts, buffering, and deploy restarts. Nginx configuration becomes part of the product behavior, not an afterthought.

The consistency model also deserves honesty. A small scheduler can often get away with database row locks, conditional updates, and periodic recovery of stale claims. That gives a practical at-least-once model without building a full consensus system. The trade-off is that job handlers must tolerate duplicate execution. If the API presents stronger guarantees than the implementation can actually provide, the system will eventually betray its callers.

The TypeORM bug shows why adapter boundaries matter. ORMs save time by abstracting database access, but raw query APIs often expose driver-specific behavior. The safe pattern is to convert raw results into domain-specific values immediately. For example, a claim operation should return something like { claimed: true, job } or { claimed: false }, not a raw database response whose meaning every caller has to remember.

The Zod validation pipeline had a different set of trade-offs. Strict schemas improve correctness, but overly narrow schemas can reject acceptable variation. Retry loops improve output quality, but they add latency and cost. Feeding errors back to the model makes failures self-correcting in many cases, but persistent failures still need caps, fallbacks, and observability.

The deeper connection between the two tasks is that backend systems are mostly boundary management. A database boundary can lie through an unexpected return tuple. A network boundary can lie through firewall order. An AI boundary can lie through valid JSON with invalid meaning. The job is to make those contracts explicit enough that failures become contained events instead of silent corruption.

That is the engineering lesson I took from both projects. The happy path is where the demo lives. The failure path is where the system tells the truth.

#distributed systems #job scheduling #Data Validation #system-design #Reliability

Two Tasks That Changed How I Think About Backend Engineering

Problem

Solution Approach

Trade-Offs

Comments