A deep dive into CDC techniques, their scalability impact, consistency guarantees, and API patterns, with practical guidance on choosing log‑based, trigger‑based, or polling approaches and wiring them into event‑driven pipelines.
Change Data Capture: Real‑Time Tracking of Database Mutations
![]()
Change Data Capture (CDC) is the practice of observing every row‑level mutation—INSERT, UPDATE, DELETE—in a production database and pushing that information to downstream systems. It enables event‑driven architectures, keeps analytical stores in sync, and powers real‑time dashboards without littering business code with instrumentation.
The problem: keeping downstream systems consistent at scale
Traditional batch ETL jobs run every few hours or days. In a high‑traffic service that processes thousands of writes per second, that latency creates stale caches, delayed fraud detection, and duplicated effort when developers build custom sync scripts. The core challenge is two‑fold:
- Capture every change reliably while the source database continues to serve traffic.
- Propagate those changes to consumers (Kafka topics, search indexes, materialized views) with predictable latency and ordering guarantees.
Both dimensions have scalability implications. A naïve polling solution will hammer the primary, increase lock contention, and explode cost as traffic grows. An approach that reads the transaction log can stay near‑zero impact but must handle log‑format evolution and fault‑tolerant offset management.
Solution approaches
1. Log‑based CDC (the default for most production workloads)
| Aspect | Details |
|---|---|
| Source | Database write‑ahead log (WAL for PostgreSQL, binlog for MySQL, transaction log for SQL Server, oplog for MongoDB). |
| How it works | A connector tails the log, deserializes each record change event, and emits a structured payload (usually JSON or Avro) to a streaming platform. |
| Scalability | The log is already written once per transaction; the connector adds a lightweight read‑only stream. Horizontal scaling is achieved by partitioning the log (e.g., per table or shard) and running multiple connector instances. |
| Consistency model | Exactly‑once delivery is possible when the connector participates in the same transaction group as the downstream sink (e.g., Kafka’s transactional producer). Ordering is preserved per partition. |
| API pattern | A pull‑based API: the connector registers a consumer group with Kafka, pulls batches, and acknowledges offsets only after successful downstream writes. |
| Trade‑offs | No schema changes required, minimal impact on the primary. Requires access to the log (often privileged) and careful handling of log format upgrades. |
Why it scales – The log is append‑only, so reading it does not interfere with writes. As traffic grows, you can increase the number of connector tasks; each task processes a disjoint slice of the log, keeping CPU and network usage linear.
2. Trigger‑based CDC (when the log is unavailable or you need fine‑grained filtering)
| Aspect | Details |
|---|---|
| Source | Database triggers that fire on INSERT/UPDATE/DELETE and write a change row to a dedicated CDC table or publish to a message broker via NOTIFY. |
| How it works | Each DML statement executes the trigger code, which extracts the OLD and NEW values and inserts a compact representation into a staging table. A separate process reads that table and forwards events. |
| Scalability | Triggers execute in the same transaction as the original write, adding CPU and I/O per row. High write rates can cause contention on the CDC table. |
| Consistency model | Changes are atomic with the original transaction, guaranteeing that no partial updates are emitted. However, ordering across tables is not guaranteed unless you serialize at the application level. |
| API pattern | Push‑based: the trigger pushes directly to a broker (e.g., pg_notify) or writes to a table that a consumer polls. |
| Trade‑offs | Full control over which columns are captured; can enrich events with business logic. Overhead grows linearly with write volume, making it unsuitable for >10k writes/sec on a single node. |
When to choose it – Legacy databases that expose no log, or cases where you must filter out noisy columns (e.g., large JSON blobs) before they ever leave the DB.
3. Polling‑based CDC (the simplest but least efficient)
| Aspect | Details |
|---|---|
| Source | Periodic SELECT queries that compare a last_modified timestamp or a version column against a stored watermark. |
| How it works | A scheduler runs every N seconds, fetches rows where updated_at > watermark, emits events, and advances the watermark. |
| Scalability | Database load grows with the frequency of polls and the size of the tables scanned. Full table scans become prohibitive as data volume rises. |
| Consistency model | At‑least‑once: if a poll fails, the next run may re‑emit the same rows. No ordering guarantees beyond timestamp. |
| API pattern | Pull‑based batch fetch; often wrapped in a REST endpoint that downstream services call on demand. |
| Trade‑offs | Zero‑code changes to the source schema, works on any RDBMS. Latency is bounded by the poll interval; high‑frequency polls increase load dramatically. |
Typical use case – Low‑traffic reference data that changes a few times per hour, where eventual consistency is acceptable.
Wiring CDC into an event‑driven pipeline
- Connector → Streaming platform: Most teams use Apache Kafka as the backbone. Tools like Debezium provide ready‑made connectors for PostgreSQL, MySQL, MongoDB, SQL Server, and Oracle. The connector writes to a topic named
dbserver1.inventory.customerswith a schema that includesbefore,after, andopfields. - Schema registry: Register Avro schemas in Confluent Schema Registry so consumers can evolve the payload safely.
- Consumer services:
- Cache invalidation – A microservice reads change events and deletes the corresponding key in Redis.
- Search index sync – A worker transforms the
afterpayload into an Elasticsearch document. - Materialized view – A stream processing job (Kafka Streams, Flink, or ksqlDB) aggregates events to maintain a real‑time leaderboard.
- Exactly‑once semantics – Enable Kafka transactions on the producer side and configure the consumer group with
isolation.level=read_committed. This guarantees that a downstream database receives each change once, even if the connector restarts.
Trade‑off matrix
| Requirement | Log‑based | Trigger‑based | Polling |
|---|---|---|---|
| Latency | < 1 s (depends on broker) | < 1 s (trigger execution) | Poll interval (seconds‑to‑minutes) |
| Primary impact | Negligible (read‑only) | Additional CPU/I/O per write | Periodic scans, can be heavy |
| Operational complexity | Requires connector deployment, offset storage | Requires trigger maintenance, CDC table cleanup | Simple scheduler, but needs watermark handling |
| Schema changes | Transparent | May need trigger rewrite | No impact |
| Exactly‑once | Supported with Kafka transactions | Harder; need idempotent downstream logic | Not natively supported |
When you design a system, start by measuring write throughput and latency SLAs. If you need sub‑second propagation at >10k writes/sec, log‑based CDC is the only viable path. If you are locked into a managed database that hides the log (e.g., some cloud‑only MySQL offerings), a lightweight trigger that writes to a CDC table may be the next best option. Reserve polling for static reference tables that change rarely.
API patterns for CDC consumers
- Pull‑based batch API – Consumers request a range of offsets from Kafka (
consumer.poll()). This pattern fits batch jobs that need to replay a window of changes, such as back‑filling a data warehouse. - Push‑based webhook – Some teams expose an HTTP endpoint and configure the CDC connector (or a sidecar) to POST each event. Guarantees low latency but requires retry logic and idempotency handling on the receiver.
- Change‑feed query – Cloud providers like Google Cloud Spanner expose a
READ_CHANGE_STREAMAPI. Consumers open a streaming RPC and receive a continuous feed, blending the pull and push models.
Choosing the right pattern depends on the consumer’s processing model. Stateless microservices that react to each event benefit from push; analytics pipelines that need to checkpoint offsets prefer pull.
Looking ahead
The CDC ecosystem is converging on three trends:
- Unified change‑feed APIs – Databases such as Snowflake, Azure Synapse, and PostgreSQL 15 are exposing native change‑feed endpoints, reducing the need for external connectors.
- Serverless CDC – Managed services (e.g., AWS DMS, Azure Data Factory) now allow you to spin up a CDC pipeline without provisioning VMs, but they often hide configuration knobs that affect latency and ordering.
- Schema‑aware enrichment – By integrating a schema registry with CDC, you can evolve payloads safely and let downstream services apply version‑specific transformations.
For engineers building real‑time pipelines, the key is to align the CDC method with the consistency guarantees your business requires and the throughput your database can sustain. Start with a log‑based connector, validate end‑to‑end latency, and only fall back to triggers or polling when the platform constraints dictate.
If you need a hands‑on example, the Debezium tutorial on streaming PostgreSQL changes to Kafka walks through containerized setup, schema registration, and a simple consumer that updates an Elasticsearch index.
Comments
Please log in or register to join the discussion