- Nightly batch re-reads ~100% of a table even when under 1% changed — billed at $0.44/DPU-hour on AWS Glue. CDC reads only the delta, cutting source bytes and Glue spend simultaneously.
- Scheduled micro-batch CDC gives you streaming-grade correctness — every INSERT, UPDATE, and DELETE in order — with cron-grade simplicity. Freshness becomes a dial, not a rebuild.
- Apache Iceberg v3, now available on AWS Glue and EMR, replaces the equality-delete bottleneck with Deletion Vectors and adds native row lineage — directly addressing the two biggest friction points of CDC micro-batch pipelines.
Nightly batch ETL is a tax paid in three currencies: compute cost, data staleness, and operational fragility. The alternative — reading the database transaction log directly and writing into an Apache Iceberg lakehouse — brings freshness from hours to single-digit minutes while eliminating the heavy, full-table scans that dominate most lakehouse bills.
This piece focuses on the core concepts and trade-offs, not implementation specifics. The same ideas apply whether you build the control plane yourself or use a managed tool.
The batch tax
Batch ETL's problem is that it re-derives state it already knew. When 0.5% of a 200M-row table changed overnight, a nightly full refresh still reads 100% of it — every night. On AWS Glue, that's DPU-hours billed for work that produces no new information ($0.44/DPU-hour standard, $0.29 on Flex). And poor data quality compounds the damage: organisations lose an average of $12.9M a year to it.
The waste shows up in three places:
- Compute cost. Full or windowed scans read data that hasn't changed. DPU-hours pay for work that produces no new information.
- Freshness. "Yesterday's data, available this morning" is the best case. Analysts and ML features run on data 12–24 hours stale. Fraud signals and operational dashboards are structurally impossible in this model.
- Operational fragility. Big batch windows fail big. A scan timing out at row 180M leaves a half-loaded table, a blown SLA, and a manual re-run that collides with the next window.
What change data capture actually is
Every major relational database keeps a durable, ordered record of every committed write — the Write-Ahead Log in PostgreSQL, the binary login MySQL, the oplog in MongoDB. These logs exist primarily so read replicas can stay in sync. CDC taps the same mechanism and turns each committed INSERT, UPDATE, and DELETE into a structured change event that can be consumed by a downstream system.
Two things most teams get wrong about CDC:
CDC captures deletes. Incremental-by-cursor does not.
A classic "incremental" pull (WHERE updated_at > :last_run) can never see a hard delete. The row is simply gone, with no updated_at to find it by. CDC reads the DELETE event straight from the log. If correctness matters — and for fraud signals, customer churn, or compliance reports it absolutely does — this distinction is the whole ballgame.
CDC doesn't have to be an always-on stream.
You can read the log in scheduled micro-batches: each run picks up from the last saved log position, drains the events that accumulated since, applies them, and checkpoints the new position. The result is streaming-grade correctness(every change, in order, including deletes) with batch-grade operational simplicity(cron, idempotent runs, easy retries). This is the most important design choice for analytics workloads.
The Iceberg write model: equality deletes
Naive append-only lakes can't represent updates or deletes without duplicate rows and tombstone hacks, pushing reconciliation cost onto every reader or onto a nightly rewrite job. Apache Iceberg v2 solves this natively.
Its merge-on-read design records a change as a small delete filerather than rewriting the underlying data. The write is fast because no data files are read or rewritten. At query time, the engine merges delete files over the data files to produce the correct current view. That trade — super-fast writes, slower reads — is what makes minute-level CDC into a lake possible at all.
Concretely, for a CDC event stream writing to an Iceberg table:
| Source event | What gets written to Iceberg |
|---|---|
INSERT | A new Parquet data file |
UPDATE | An equality-delete file marking the old row by primary key, plus a data file with the new row |
DELETE | An equality-delete file marking the row by primary key |
Compaction rewrites small data files and delete files into larger, consolidated data files, resetting read performance to baseline. It runs externally — AWS Glue compaction, Spark rewrite_data_files, Apache Amoro, or a periodic Athena OPTIMIZE — on a schedule tuned to your update rate and query SLA.
Before going to production, verify each of these against your environment:
- Tables without a primary key fall back to append-only. Upserts are keyed on the PK. A table without one can't resolve which existing row to update or delete.
- Log retention must exceed your worst-case recovery window. PostgreSQL replication slots hold WAL until the consumer confirms; MySQL binlog retention is time-based; MongoDB oplog is size-based.
- One log consumer per source stream. Replication slots, binlog consumers, and oplog readers deliver events once. Two jobs reading the same stream will each see only a partial event set.
- Source database prerequisites vary by engine. PostgreSQL requires
wal_level = logical. MySQL requiresbinlog_format = ROW, GTID enabled, andbinlog_row_image = FULL. MongoDB requires replica-set mode. - Schema changes mid-stream require care. Adding a nullable column is generally safe. Renaming, dropping, or changing a column's type can break the pipeline mid-run.
Streaming, micro-batch, or nightly batch: the strategic choice
| Ingestion pattern | Freshness | Operational model | Best fit |
|---|---|---|---|
| Nightly batch | 12–24 h | Cron, simple but fragile at scale | Low-churn, small tables; freshness not a requirement |
| Scheduled micro-batch CDC | 5–60 min (tunable) | Idempotent cron runs, checkpoint-based, easy retries | Most analytics platforms; operational dashboards; fraud signals |
| Always-on streaming CDC | Seconds–minutes | Persistent cluster, stateful offsets, 24/7 cost | Sub-minute SLAs; high-velocity event data |
Why micro-batch is often the right call
- Idempotent and checkpointed. Each run reads from the last confirmed log position, writes to Iceberg, and saves the new position. A run that dies halfway resumes cleanly from the last checkpoint.
- Freshness is a dial, not a rebuild. Want 10-minute data? Set the cron interval to 10 minutes. Changing the freshness SLA is a configuration edit, not a re-architecture.
- Compute is mostly idle, so it's cheap. An always-on streaming cluster runs 24/7. A micro-batch worker is idle between ticks — a fraction of the cost, especially against Glue's $0.44/DPU-hour baseline.
Financial math
The financial case turns on where DPU-hours go. AWS Glue bills standard Apache Spark ETL at $0.44/DPU-hour (Flex: $0.29, a 34% discount). CDC attacks both the source read cost and the nightly Glue rewrite line item simultaneously.
| Nightly batch + Glue | CDC into Iceberg | |
|---|---|---|
| Source bytes read / day | ~100% of every table | Delta only (often <1%) |
| Heavy Glue rewrite | Every night, full table | Replaced by periodic small-file compaction |
| Data freshness | 12–24 h | Minutes (tunable) |
| Captures deletes? | Only with full refresh | Yes, from the log |
| Failure blast radius | Whole nightly window | One checkpointed micro-batch |
Iceberg v3: how it changes the game for CDC micro-batch
Every micro-batch pipeline built on Iceberg v2 carries one structural liability: the more frequently you sync, the faster equality-delete files pile up, and the slower your queries get. Iceberg v3, now available on AWS Glue ETL, Amazon EMR 7.12, and S3 Tables, removes that liability at the format level with two features built specifically for high-frequency, write-heavy workloads.
Deletion Vectors: micro-batch writes no longer fight query performance
In v2, every UPDATE or DELETE in a micro-batch run writes a new equality-delete file. Ten runs an hour on a hot table means ten new delete files per hour, and the query engine must join all of them against the data files on every read. Iceberg v3 replaces this with Deletion Vectors — a compact Roaring Bitmap stored in a Puffin sidecar file paired one-to-one with its data file. Delete state is co-located with data, not scattered across a growing pool of separate files.
- High-frequency syncs stay cheap. Upserts no longer accumulate write amplification between compaction windows.
- Compaction becomes maintenance, not survival. It's no longer racing against a degrading query performance curve.
- Sub-10-minute cadences become practical. The performance ceiling that made aggressive micro-batch schedules risky on v2 is substantially raised.
Row lineage: the lakehouse becomes the source of truth for change tracking
V3 eliminates the need for custom updated_at columns, separate audit tables, or full partition re-scans. Every row automatically carries two metadata fields:_row_id (a stable identifier across the row's entire lifetime) and_last_updated_sequence_number (the snapshot sequence number of the last modification). Downstream consumers can query directly for rows changed since a given checkpoint — no custom tracking logic, no full re-scan.
Upgrading existing tables
V3 is backward compatible: existing v2 tables upgrade atomically without rewriting data files. The upgrade is one-way — v3 cannot be downgraded — so validate in a non-production environment first. All AWS analytics engines (Glue, EMR, Athena, SageMaker) support both v2 and v3 simultaneously.
Should you ditch the batch?
Reach for CDC-into-Iceberg when:
- Your sources are PostgreSQL, MySQL, MongoDB, or SQL Server with logical replication / binlog / oplog available — or any source reachable via DMS.
- You need deletes captured correctly, or freshness measured in minutes, not hours.
- Your Glue bill is dominated by full-table rewrites of mostly-unchanged data.
Stay on batch for now when:
- Sources don't expose a log (some managed SaaS APIs, file drops).
- Daily freshness is genuinely sufficient and volumes are small.
- Your team can't yet commit to running compaction as an ongoing operational concern.
How ETL Engine implements this
ETL Engine is CloudZA's managed CDC-into-lakehouse control plane. It puts the concepts above into production across any AWS region without requiring teams to wire up and operate the individual moving parts.
Supported sources
| Source | Direct log CDC | Via DMS → S3/Kafka |
|---|---|---|
| PostgreSQL | ✓ WAL / pgoutput | ✓ |
| MySQL / MariaDB | ✓ binlog / GTID | ✓ |
| MongoDB | ✓ oplog | ✓ |
| SQL Server | ✓ CDC change tables | ✓ |
| Oracle | Roadmap | ✓ via DMS |
| IBM Db2 | — | ✓ via DMS |
| Amazon S3 / files | ✓ file ingestion (snapshot + incremental) | |
| Apache Kafka | ✓ streaming topic ingestion (DMS output or direct) | |
Frequently asked questions
WHERE updated_at > :last_run query can never see a deleted row because it's gone from the table. CDC reads the DELETE event directly from the log, keeping the lakehouse consistent with the source system.We build and operate CDC-into-lakehouse pipelines across multi-region AWS environments, with data-residency controls, runtime credential isolation, and built-in compaction. Let's talk about a CDC pilot.