Silent Scheduler Failures and a Global‑State Cleanup: What We Shipped on 2026‑05‑30
#Infrastructure

Silent Scheduler Failures and a Global‑State Cleanup: What We Shipped on 2026‑05‑30

Backend Reporter
5 min read

A four‑day outage in long‑interval maintenance jobs was traced to a missing next_run_time in the scheduler, leading to a redesign that anchors interval triggers to persisted timestamps. At the same time we eliminated ad‑hoc globals by introducing a process‑wide AppContainer, and added a robust fallback for a reasoning model that returned malformed JSON. The changes improve scalability, tighten consistency guarantees, and simplify the API surface for future extensions.

Silent Scheduler Failures and a Global‑State Cleanup: What We Shipped on 2026‑05‑30

The problem – intermittent long‑interval jobs vanished

Our automation platform runs two families of jobs:

  • Sub‑hour tasks (e.g., cache refreshes) that fire every few minutes.
  • Daily maintenance jobs (e.g., analyze_topic_gaps, detect_duplicate_posts) that should fire once every 24 hours.

During a routine health check we discovered that the daily jobs had not executed for four days while the sub‑hour jobs kept running. The symptom was subtle: the scheduler log showed the jobs being registered on each worker start, but never reaching their next execution time.

Why it mattered for scalability and consistency

  • Scalability – In a horizontally scaled worker pool, each restart re‑anchored the interval to the boot timestamp. As developers frequently redeployed during active feature work, the effective interval kept sliding forward, effectively creating a silent killer that grew worse with more restarts.
  • Consistency model – The missing next_run_time meant the scheduler’s view of “when the job should run next” diverged from the persisted state of the system. Other services that relied on the job’s side‑effects (e.g., topic‑gap metrics) observed stale data, breaking eventual consistency guarantees.
  • API pattern impact – The register_job API silently accepted an incomplete IntervalTrigger. Consumers had no way to detect that the trigger was malformed, violating the principle of fail‑fast APIs.

Solution approach – Persisted anchors and a single DI container

1️⃣ Anchor interval jobs to persisted timestamps

We introduced a plugin‑specific key plugin_job_last_run_<name> stored in our MongoDB collection. The scheduler now:

  1. Reads the last successful run timestamp (or falls back to the current time on first start).
  2. Calculates the next fire time based on the configured interval relative to that persisted epoch.
  3. Stores the new next_run_time back to the same key after each successful execution.

This change decouples job cadence from the process clock. Even if a worker restarts dozens of times, the interval remains anchored to the last real execution, guaranteeing that a 24‑hour job will fire once per calendar day regardless of deployment frequency.

2️⃣ Replace ad‑hoc globals with a process‑wide AppContainer

The codebase historically relied on per‑module globals and a mutable WIRED_MODULES list. Those globals made testing brittle and introduced hidden coupling:

  • Modules could be imported in any order, leading to nondeterministic state.
  • DI (dependency injection) was performed manually, scattering factories across the repo.

We now expose a single entry pointservices/bootstrap.py::build_container – that constructs an AppContainer object holding all service instances. Every production caller, CLI command, and Prefect workflow obtains dependencies through this container. The benefits are concrete:

  • Predictable lifecycle – objects are created once per process, then reused, matching the typical singleton semantics without global mutation.
  • Testability – tests can instantiate a fresh container with mock implementations, eliminating side‑effects from previous test runs.
  • Scalability – as we add new micro‑services, they simply register themselves with the container; no new globals are introduced.

3️⃣ Guard against malformed model responses

Our content‑generation pipeline calls the reasoning model glm‑4.7‑5090. Occasionally the model emitted an empty JSON object and placed the actual tokens in a reasoning_content channel instead of the expected content field. The downstream run_sweep step attempted to deserialize the empty object, raising a JSONDecodeError and aborting the entire sweep.

We added a fallback parser that:

  • Detects the presence of a non‑empty reasoning_content field.
  • Strips the surrounding “thinking wrapper” and promotes the payload to the primary content field.
  • Logs a warning and continues processing.

This defensive pattern follows the “graceful degradation” API style: the service remains available even when upstream models misbehave, preserving overall system throughput.

Trade‑offs and what to watch next

Aspect Benefit Cost / Consideration
Persisted job anchors Guarantees daily jobs run exactly once per day, independent of restarts. Requires an extra write per job execution; negligible for our workload but adds a small latency spike.
AppContainer DI Centralizes object creation, eliminates hidden globals, improves test isolation. Initial refactor effort was high; developers must now acquire dependencies via the container rather than direct imports.
Model fallback Prevents a single malformed response from halting the entire sweep. May mask systematic issues with the model; we should add metrics to track fallback frequency and alert if it rises above a threshold.

Broader implications for our stack

  • Scalability – By persisting scheduling state, we can safely increase the number of worker nodes without fearing duplicate or missed executions. The scheduler becomes idempotent with respect to restarts, a key property for autoscaling clusters.
  • Consistency – The persisted next_run_time aligns the scheduler’s view with the database’s view, tightening our eventual consistency guarantees across services that depend on maintenance jobs.
  • API patterns – The revised register_job now validates its trigger configuration and returns an explicit error if required fields are missing. This mirrors the “fail fast” approach we use for our public REST endpoints, encouraging callers to handle errors early.

Looking ahead

The next iteration will focus on distributed lock coordination for jobs that must not overlap across multiple workers. We plan to adopt MongoDB’s $setOnInsert pattern combined with a short‑lived lease document, ensuring that even in a multi‑region deployment only one instance runs the critical section.

For developers interested in the underlying changes, the full diff is available in the merged pull requests:

Featured image

The work was auto‑compiled by Poindexter, our internal code‑generation pipeline that stitches together the live atom catalog. You can explore the repository at the official GitHub page.


If you run long‑interval jobs on a platform that frequently restarts, consider persisting the last run timestamp. It’s a tiny change that prevents a silent outage from turning into a multi‑day data quality incident.

Comments

Loading comments...