A four‑day outage in long‑interval maintenance jobs was traced to a missing next_run_time in the scheduler, leading to a redesign that anchors interval triggers to persisted timestamps. At the same time we eliminated ad‑hoc globals by introducing a process‑wide AppContainer, and added a robust fallback for a reasoning model that returned malformed JSON. The changes improve scalability, tighten consistency guarantees, and simplify the API surface for future extensions.
Silent Scheduler Failures and a Global‑State Cleanup: What We Shipped on 2026‑05‑30
The problem – intermittent long‑interval jobs vanished
Our automation platform runs two families of jobs:
- Sub‑hour tasks (e.g., cache refreshes) that fire every few minutes.
- Daily maintenance jobs (e.g.,
analyze_topic_gaps,detect_duplicate_posts) that should fire once every 24 hours.
During a routine health check we discovered that the daily jobs had not executed for four days while the sub‑hour jobs kept running. The symptom was subtle: the scheduler log showed the jobs being registered on each worker start, but never reaching their next execution time.
Why it mattered for scalability and consistency
- Scalability – In a horizontally scaled worker pool, each restart re‑anchored the interval to the boot timestamp. As developers frequently redeployed during active feature work, the effective interval kept sliding forward, effectively creating a silent killer that grew worse with more restarts.
- Consistency model – The missing
next_run_timemeant the scheduler’s view of “when the job should run next” diverged from the persisted state of the system. Other services that relied on the job’s side‑effects (e.g., topic‑gap metrics) observed stale data, breaking eventual consistency guarantees. - API pattern impact – The
register_jobAPI silently accepted an incompleteIntervalTrigger. Consumers had no way to detect that the trigger was malformed, violating the principle of fail‑fast APIs.
Solution approach – Persisted anchors and a single DI container
1️⃣ Anchor interval jobs to persisted timestamps
We introduced a plugin‑specific key plugin_job_last_run_<name> stored in our MongoDB collection. The scheduler now:
- Reads the last successful run timestamp (or falls back to the current time on first start).
- Calculates the next fire time based on the configured interval relative to that persisted epoch.
- Stores the new
next_run_timeback to the same key after each successful execution.
This change decouples job cadence from the process clock. Even if a worker restarts dozens of times, the interval remains anchored to the last real execution, guaranteeing that a 24‑hour job will fire once per calendar day regardless of deployment frequency.
2️⃣ Replace ad‑hoc globals with a process‑wide AppContainer
The codebase historically relied on per‑module globals and a mutable WIRED_MODULES list. Those globals made testing brittle and introduced hidden coupling:
- Modules could be imported in any order, leading to nondeterministic state.
- DI (dependency injection) was performed manually, scattering factories across the repo.
We now expose a single entry point – services/bootstrap.py::build_container – that constructs an AppContainer object holding all service instances. Every production caller, CLI command, and Prefect workflow obtains dependencies through this container. The benefits are concrete:
- Predictable lifecycle – objects are created once per process, then reused, matching the typical singleton semantics without global mutation.
- Testability – tests can instantiate a fresh container with mock implementations, eliminating side‑effects from previous test runs.
- Scalability – as we add new micro‑services, they simply register themselves with the container; no new globals are introduced.
3️⃣ Guard against malformed model responses
Our content‑generation pipeline calls the reasoning model glm‑4.7‑5090. Occasionally the model emitted an empty JSON object and placed the actual tokens in a reasoning_content channel instead of the expected content field. The downstream run_sweep step attempted to deserialize the empty object, raising a JSONDecodeError and aborting the entire sweep.
We added a fallback parser that:
- Detects the presence of a non‑empty
reasoning_contentfield. - Strips the surrounding “thinking wrapper” and promotes the payload to the primary
contentfield. - Logs a warning and continues processing.
This defensive pattern follows the “graceful degradation” API style: the service remains available even when upstream models misbehave, preserving overall system throughput.
Trade‑offs and what to watch next
| Aspect | Benefit | Cost / Consideration |
|---|---|---|
| Persisted job anchors | Guarantees daily jobs run exactly once per day, independent of restarts. | Requires an extra write per job execution; negligible for our workload but adds a small latency spike. |
| AppContainer DI | Centralizes object creation, eliminates hidden globals, improves test isolation. | Initial refactor effort was high; developers must now acquire dependencies via the container rather than direct imports. |
| Model fallback | Prevents a single malformed response from halting the entire sweep. | May mask systematic issues with the model; we should add metrics to track fallback frequency and alert if it rises above a threshold. |
Broader implications for our stack
- Scalability – By persisting scheduling state, we can safely increase the number of worker nodes without fearing duplicate or missed executions. The scheduler becomes idempotent with respect to restarts, a key property for autoscaling clusters.
- Consistency – The persisted
next_run_timealigns the scheduler’s view with the database’s view, tightening our eventual consistency guarantees across services that depend on maintenance jobs. - API patterns – The revised
register_jobnow validates its trigger configuration and returns an explicit error if required fields are missing. This mirrors the “fail fast” approach we use for our public REST endpoints, encouraging callers to handle errors early.
Looking ahead
The next iteration will focus on distributed lock coordination for jobs that must not overlap across multiple workers. We plan to adopt MongoDB’s $setOnInsert pattern combined with a short‑lived lease document, ensuring that even in a multi‑region deployment only one instance runs the critical section.
For developers interested in the underlying changes, the full diff is available in the merged pull requests:

The work was auto‑compiled by Poindexter, our internal code‑generation pipeline that stitches together the live atom catalog. You can explore the repository at the official GitHub page.
If you run long‑interval jobs on a platform that frequently restarts, consider persisting the last run timestamp. It’s a tiny change that prevents a silent outage from turning into a multi‑day data quality incident.

Comments
Please log in or register to join the discussion