A deep dive into CDC techniques, their scalability impact, consistency guarantees, and API patterns, with practical guidance on choosing log‑based, trigger‑based, or polling approaches and wiring them into event‑driven pipelines.

Change Data Capture: Real‑Time Tracking of Database Mutations

Gen AI apps are built with MongoDB Atlas

Change Data Capture (CDC) is the practice of observing every row‑level mutation—INSERT, UPDATE, DELETE—in a production database and pushing that information to downstream systems. It enables event‑driven architectures, keeps analytical stores in sync, and powers real‑time dashboards without littering business code with instrumentation.

The problem: keeping downstream systems consistent at scale

Traditional batch ETL jobs run every few hours or days. In a high‑traffic service that processes thousands of writes per second, that latency creates stale caches, delayed fraud detection, and duplicated effort when developers build custom sync scripts. The core challenge is two‑fold:

Capture every change reliably while the source database continues to serve traffic.
Propagate those changes to consumers (Kafka topics, search indexes, materialized views) with predictable latency and ordering guarantees.

Both dimensions have scalability implications. A naïve polling solution will hammer the primary, increase lock contention, and explode cost as traffic grows. An approach that reads the transaction log can stay near‑zero impact but must handle log‑format evolution and fault‑tolerant offset management.

Solution approaches

1. Log‑based CDC (the default for most production workloads)

Aspect	Details
Source	Database write‑ahead log (WAL for PostgreSQL, binlog for MySQL, transaction log for SQL Server, oplog for MongoDB).
How it works	A connector tails the log, deserializes each record change event, and emits a structured payload (usually JSON or Avro) to a streaming platform.
Scalability	The log is already written once per transaction; the connector adds a lightweight read‑only stream. Horizontal scaling is achieved by partitioning the log (e.g., per table or shard) and running multiple connector instances.
Consistency model	Exactly‑once delivery is possible when the connector participates in the same transaction group as the downstream sink (e.g., Kafka’s transactional producer). Ordering is preserved per partition.
API pattern	A pull‑based API: the connector registers a consumer group with Kafka, pulls batches, and acknowledges offsets only after successful downstream writes.
Trade‑offs	No schema changes required, minimal impact on the primary. Requires access to the log (often privileged) and careful handling of log format upgrades.

Why it scales – The log is append‑only, so reading it does not interfere with writes. As traffic grows, you can increase the number of connector tasks; each task processes a disjoint slice of the log, keeping CPU and network usage linear.

2. Trigger‑based CDC (when the log is unavailable or you need fine‑grained filtering)

Aspect	Details
Source	Database triggers that fire on INSERT/UPDATE/DELETE and write a change row to a dedicated CDC table or publish to a message broker via `NOTIFY`.
How it works	Each DML statement executes the trigger code, which extracts the `OLD` and `NEW` values and inserts a compact representation into a staging table. A separate process reads that table and forwards events.
Scalability	Triggers execute in the same transaction as the original write, adding CPU and I/O per row. High write rates can cause contention on the CDC table.
Consistency model	Changes are atomic with the original transaction, guaranteeing that no partial updates are emitted. However, ordering across tables is not guaranteed unless you serialize at the application level.
API pattern	Push‑based: the trigger pushes directly to a broker (e.g., `pg_notify`) or writes to a table that a consumer polls.
Trade‑offs	Full control over which columns are captured; can enrich events with business logic. Overhead grows linearly with write volume, making it unsuitable for >10k writes/sec on a single node.

When to choose it – Legacy databases that expose no log, or cases where you must filter out noisy columns (e.g., large JSON blobs) before they ever leave the DB.

3. Polling‑based CDC (the simplest but least efficient)

Aspect	Details
Source	Periodic `SELECT` queries that compare a `last_modified` timestamp or a version column against a stored watermark.
How it works	A scheduler runs every N seconds, fetches rows where `updated_at > watermark`, emits events, and advances the watermark.
Scalability	Database load grows with the frequency of polls and the size of the tables scanned. Full table scans become prohibitive as data volume rises.
Consistency model	At‑least‑once: if a poll fails, the next run may re‑emit the same rows. No ordering guarantees beyond timestamp.
API pattern	Pull‑based batch fetch; often wrapped in a REST endpoint that downstream services call on demand.
Trade‑offs	Zero‑code changes to the source schema, works on any RDBMS. Latency is bounded by the poll interval; high‑frequency polls increase load dramatically.

Typical use case – Low‑traffic reference data that changes a few times per hour, where eventual consistency is acceptable.

Wiring CDC into an event‑driven pipeline

Connector → Streaming platform: Most teams use Apache Kafka as the backbone. Tools like Debezium provide ready‑made connectors for PostgreSQL, MySQL, MongoDB, SQL Server, and Oracle. The connector writes to a topic named dbserver1.inventory.customers with a schema that includes before, after, and op fields.
Schema registry: Register Avro schemas in Confluent Schema Registry so consumers can evolve the payload safely.
Consumer services:
- Cache invalidation – A microservice reads change events and deletes the corresponding key in Redis.
- Search index sync – A worker transforms the after payload into an Elasticsearch document.
- Materialized view – A stream processing job (Kafka Streams, Flink, or ksqlDB) aggregates events to maintain a real‑time leaderboard.
Exactly‑once semantics – Enable Kafka transactions on the producer side and configure the consumer group with isolation.level=read_committed. This guarantees that a downstream database receives each change once, even if the connector restarts.

Trade‑off matrix

Requirement	Log‑based	Trigger‑based	Polling
Latency	< 1 s (depends on broker)	< 1 s (trigger execution)	Poll interval (seconds‑to‑minutes)
Primary impact	Negligible (read‑only)	Additional CPU/I/O per write	Periodic scans, can be heavy
Operational complexity	Requires connector deployment, offset storage	Requires trigger maintenance, CDC table cleanup	Simple scheduler, but needs watermark handling
Schema changes	Transparent	May need trigger rewrite	No impact
Exactly‑once	Supported with Kafka transactions	Harder; need idempotent downstream logic	Not natively supported

When you design a system, start by measuring write throughput and latency SLAs. If you need sub‑second propagation at >10k writes/sec, log‑based CDC is the only viable path. If you are locked into a managed database that hides the log (e.g., some cloud‑only MySQL offerings), a lightweight trigger that writes to a CDC table may be the next best option. Reserve polling for static reference tables that change rarely.

API patterns for CDC consumers

Pull‑based batch API – Consumers request a range of offsets from Kafka (consumer.poll()). This pattern fits batch jobs that need to replay a window of changes, such as back‑filling a data warehouse.
Push‑based webhook – Some teams expose an HTTP endpoint and configure the CDC connector (or a sidecar) to POST each event. Guarantees low latency but requires retry logic and idempotency handling on the receiver.
Change‑feed query – Cloud providers like Google Cloud Spanner expose a READ_CHANGE_STREAM API. Consumers open a streaming RPC and receive a continuous feed, blending the pull and push models.

Choosing the right pattern depends on the consumer’s processing model. Stateless microservices that react to each event benefit from push; analytics pipelines that need to checkpoint offsets prefer pull.

Looking ahead

The CDC ecosystem is converging on three trends:

Unified change‑feed APIs – Databases such as Snowflake, Azure Synapse, and PostgreSQL 15 are exposing native change‑feed endpoints, reducing the need for external connectors.
Serverless CDC – Managed services (e.g., AWS DMS, Azure Data Factory) now allow you to spin up a CDC pipeline without provisioning VMs, but they often hide configuration knobs that affect latency and ordering.
Schema‑aware enrichment – By integrating a schema registry with CDC, you can evolve payloads safely and let downstream services apply version‑specific transformations.

For engineers building real‑time pipelines, the key is to align the CDC method with the consistency guarantees your business requires and the throughput your database can sustain. Start with a log‑based connector, validate end‑to‑end latency, and only fall back to triggers or polling when the platform constraints dictate.

If you need a hands‑on example, the Debezium tutorial on streaming PostgreSQL changes to Kafka walks through containerized setup, schema registration, and a simple consumer that updates an Elasticsearch index.

Change Data Capture: Real‑Time Tracking of Database Mutations

Change Data Capture: Real‑Time Tracking of Database Mutations

The problem: keeping downstream systems consistent at scale

Solution approaches

1. Log‑based CDC (the default for most production workloads)

2. Trigger‑based CDC (when the log is unavailable or you need fine‑grained filtering)

3. Polling‑based CDC (the simplest but least efficient)

Wiring CDC into an event‑driven pipeline

Trade‑off matrix

API patterns for CDC consumers

Looking ahead

Comments