From Batch to Micro-Batch Streaming: Optimizing Delta Index Pipelines for Freshness
#Infrastructure

From Batch to Micro-Batch Streaming: Optimizing Delta Index Pipelines for Freshness

Cloud Reporter
7 min read

A detailed analysis of transitioning from scheduled batch jobs to micro-batch streaming in a production delta index pipeline, revealing that scheduling delays often create more latency than processing costs.

From Batch to Micro-Batch Streaming: Optimizing Delta Index Pipelines for Freshness

Many organizations operate data pipelines that function as near-continuous systems, processing incremental data frequently with overlapping windows, yet still describe themselves as batch systems. This article examines the migration of such a system—a set of scheduled batch jobs responsible for generating a delta index used in a search and ads retrieval pipeline—from discrete scheduled execution to continuously running micro-batch streaming using Spark Structured Streaming.

Featured image

The Core Problem: Scheduling Delay vs. Processing Cost

The primary bottleneck in our delta index pipeline was not computational efficiency, but scheduling delay and orchestration overhead. Our system processed incremental ads, ad groups, and campaign updates to generate an inverted index used by online retrieval services. At scale, this involved millions of documents with full index sizes in the hundreds of gigabytes and delta sizes typically in the tens of gigabytes range.

In the original batch model:

  • New incremental data arrived every five to seven minutes
  • Each delta run covered approximately the last five hours
  • Multiple delta runs per hour were expected and necessary

Despite this frequent execution, the externally scheduled batch jobs created significant freshness gaps. Data arriving shortly after a scheduled run often waited nearly a full scheduling interval before processing. Failures required re-executing the entire scheduled window, and during bursts of updates, batch duration increases reduced or eliminated idle gaps between runs.

The key insight emerged: batch scheduling delay and orchestration overhead, rather than processing cost, were the primary contributors to freshness lag.

Why Traditional Streaming Approaches Failed

When streaming was first proposed, the pushback centered on operational concerns:

  • Long-running streaming jobs are harder to reason about operationally
  • Recovery behavior can be unpredictable
  • Failures tend to linger instead of terminating cleanly
  • On-call burden often increases rather than decreases

Our initial attempt at record-level streaming quickly revealed it was solving a problem we didn't have while introducing unnecessary complexity. The indexing logic assumed batch completeness and operated at a product or item grouping level rather than at the individual ad record level. Moving to per-record streaming would have introduced partial-update states where some ads were updated but the grouped index representation was not yet fully consistent.

The business didn't need per-record immediacy. What it needed was to stop waiting for batch schedules. This led to our first major realization: record-level streaming was introducing operational risks without delivering meaningful benefits for our use case.

Converging on Micro-Batch Streaming

We deliberately chose micro-batch streaming, not for continuous record processing, but for continuous availability. This approach allowed us to remove scheduling gaps while preserving batch-oriented semantics where they mattered.

The job was configured with a fixed-trigger interval of approximately thirty seconds using Spark Structured Streaming in micro-batch mode with a fixed processing time trigger. This interval was intentionally much smaller than the partition arrival cadence of five to seven minutes, ensuring newly available data was picked up quickly without external scheduling gaps.

Each trigger execution followed a bounded, deterministic sequence:

  1. Determine the current wall-clock time
  2. Compute the latest partition that should be considered eligible based on time and partitioning rules
  3. Compare that partition with the current watermark (the last acknowledged partition)
  4. If multiple partitions were pending, advance directly to the latest eligible partition rather than processing intermediate ones
  5. Process one bounded partition per trigger cycle at most

This approach prioritized freshness by advancing directly to the latest available partition, accepting that intermediate states would be implicitly covered by overlapping window computations. The sliding window duration was significantly larger than the partition interval, ensuring any skipped partitions were always covered by subsequent recomputation.

Object Storage Challenges and Solutions

Both the source and sink of our pipeline used object storage with time-partitioned paths like /year/month/day/hour/minute. This created unique challenges for progress tracking.

Our first attempt reused the existing success-file and completion-marker logic from the batch pipeline. This approach quickly broke down in a continuously running environment:

  • Completion markers appeared late or inconsistently due to non-atomic visibility between data files and completion markers
  • Partition listings were temporarily incomplete, even when data had already been written
  • The streaming job repeatedly polled for markers that already existed but were not yet visible
  • Duplicate or premature processing occurred when listings and markers became visible at slightly different times

We replaced this with a deterministic, rate-based trigger that executed every thirty seconds. The system no longer waited for specific completion signals but advanced deterministically based on time and watermark comparison:

  1. List the currently visible partitions in object storage
  2. Identify the latest visible partition based on timestamp ordering
  3. Compare it to the watermark (the last partition the pipeline has acknowledged)
  4. Process the latest partition and advance the watermark if newer
  5. Otherwise, exit immediately

This approach eliminated dependency on fragile completion markers while maintaining predictable progress and quick data pickup.

Handling Lag and Restart Behavior

A critical design decision was how to handle lag—what should happen if multiple new partitions became visible between trigger cycles?

Instead of processing each unprocessed partition in order (the traditional streaming approach), we implemented a freshness-first rule: always advance directly to the most recent visible partition. This behavior was safe because:

  • Each delta run recomputed a recent time window rather than applying small incremental changes
  • These windows overlapped, so most skipped partitions were naturally covered in later runs
  • The system didn't require processing every intermediate state, only reaching the latest correct snapshot
  • Any missed updates were picked up by the next full index rebuild, which ran every few hours as a bounded recovery mechanism

The same rule applied during restarts. On restart, the job would recompute the latest visible partition, compare it with the persisted watermark, and process only the latest partition if newer. This avoided reliance on the streaming framework's checkpointing, which is designed for sequential replay and didn't align with our freshness-first model.

Operational Considerations: Memory Pressure and Planned Restarts

Running continuously in a JVM-based, long-lived streaming job exposed problems that scheduled batch jobs had never encountered. Heap usage gradually increased, garbage collection pauses became more frequent, and micro-batch completion times grew less predictable over extended runtimes.

Rather than fighting this behavior, we embraced planned restarts as an operational tool. The job was designed to restart automatically every twenty-four hours when healthy, accomplishing several objectives:

  • Releasing accumulated memory
  • Resetting internal execution state
  • Allowing new code to be picked up without manual intervention

To make the system predictable, we introduced a lightweight watchdog implemented externally to the streaming runtime. The watchdog monitored job liveness, restarted jobs on unexpected termination, enforced periodic restarts, and standardized behavior across environments. Failures became routine and recoverable instead of urgent and disruptive.

Business Impact and Results

After the migration to micro-batch streaming, end-to-end latency dropped by approximately fifty percent. The worst-case freshness delay reduced from roughly ten minutes to thirty seconds, primarily due to the elimination of scheduler and orchestration delays rather than changes in per-batch processing cost.

In production, where delta freshness directly affected ad visibility and advertiser expectations, we observed:

  • Faster propagation of incremental updates
  • Predictable recovery behavior
  • Fewer memory-related incidents
  • Easier rollout of code changes

These results were achieved without committing to record-level streaming, demonstrating that the best streaming design is the one that works reliably in production, not the one that looks the most elegant in theory.

When This Approach Applies and When It Doesn't

This micro-batch streaming pattern works well when:

  • Data naturally fits into bounded snapshots
  • The goal is to stay close to the latest state rather than process every intermediate step
  • Ingestion relies on object storage where completion signals are inferred
  • Freshness matters more than strict replay of intermediate states

It is not a good fit for systems that:

  • Require strict per-record processing guarantees
  • Need ordered replay of all historical events
  • Demand strong guarantees on processing every update

The key lesson is that streaming success comes from understanding the specific business problem being solved rather than applying the most technically advanced solution available.

About the Author

Parveen Saini is a Staff Software Engineer II with deep experience building and operating large-scale distributed systems, including search and indexing platforms, backend services, and data-intensive pipelines. His work focuses on pragmatic system design, performance, and reliability in production, with an emphasis on making complex systems simpler and more predictable to operate.

Author photo

Comments

Loading comments...