Ditching the Batch

CDC Micro-Batch into Apache Iceberg — Architecture OverviewSOURCESPostgreSQLWAL · pgoutputMySQL / MariaDBbinlog · GTIDMongoDBoplog · replica setSQL ServerCDC change tablesOracle / Db2via AWS DMSMICRO-BATCH SCHEDULERCron triggerreads log fromlast checkpointCheckpoint · state.jsonidempotent · resumableParquet +Delete VectorsAPACHE ICEBERG LAKEHOUSEIceberg Table (v3)Deletion VectorsRow LineageAWS GlueCatalogAmazon S3Parquet · DVsQuery enginesAthena · Spark · Trino · RedshiftSageMaker · QuickSightdirect CDCvia DMS bridgeParquet write
Data EngineeringApache IcebergCDC

Ditching the Batch: Real-Time CDC Ingestion into Iceberg

Replace nightly batch ETL with log-based change data capture into Apache Iceberg. The result: data freshness measured in minutes instead of hours, and a Glue bill that reflects what actually changed instead of everything that exists.

CZ
CloudZA Data Platform TeamBuilding highly scalable CDC-into-lakehouse systems that bring near-real-time streaming performance to micro-batch pipelines
Key Takeaways
  • Nightly batch re-reads ~100% of a table even when under 1% changed — billed at $0.44/DPU-hour on AWS Glue. CDC reads only the delta, cutting source bytes and Glue spend simultaneously.
  • Scheduled micro-batch CDC gives you streaming-grade correctness — every INSERT, UPDATE, and DELETE in order — with cron-grade simplicity. Freshness becomes a dial, not a rebuild.
  • Apache Iceberg v3, now available on AWS Glue and EMR, replaces the equality-delete bottleneck with Deletion Vectors and adds native row lineage — directly addressing the two biggest friction points of CDC micro-batch pipelines.

Nightly batch ETL is a tax paid in three currencies: compute cost, data staleness, and operational fragility. The alternative — reading the database transaction log directly and writing into an Apache Iceberg lakehouse — brings freshness from hours to single-digit minutes while eliminating the heavy, full-table scans that dominate most lakehouse bills.

This piece focuses on the core concepts and trade-offs, not implementation specifics. The same ideas apply whether you build the control plane yourself or use a managed tool.

The batch tax

Batch ETL's problem is that it re-derives state it already knew. When 0.5% of a 200M-row table changed overnight, a nightly full refresh still reads 100% of it — every night. On AWS Glue, that's DPU-hours billed for work that produces no new information ($0.44/DPU-hour standard, $0.29 on Flex). And poor data quality compounds the damage: organisations lose an average of $12.9M a year to it.

The waste shows up in three places:

  • Compute cost. Full or windowed scans read data that hasn't changed. DPU-hours pay for work that produces no new information.
  • Freshness. "Yesterday's data, available this morning" is the best case. Analysts and ML features run on data 12–24 hours stale. Fraud signals and operational dashboards are structurally impossible in this model.
  • Operational fragility. Big batch windows fail big. A scan timing out at row 180M leaves a half-loaded table, a blown SLA, and a manual re-run that collides with the next window.
Source bytes read per run: batch vs CDCSource bytes read per run200M-row table · 0.5% daily churnNightly full refresh~100% of tableLog-based CDC (delta only)~0.5% of tableIllustrative example · Glue ETL billed at $0.44/DPU-hr (AWS Glue Pricing, 2026)
~200× fewer source bytes read after the initial snapshot. Source: AWS Glue Pricing (2026).

What change data capture actually is

Every major relational database keeps a durable, ordered record of every committed write — the Write-Ahead Log in PostgreSQL, the binary login MySQL, the oplog in MongoDB. These logs exist primarily so read replicas can stay in sync. CDC taps the same mechanism and turns each committed INSERT, UPDATE, and DELETE into a structured change event that can be consumed by a downstream system.

Two things most teams get wrong about CDC:

CDC captures deletes. Incremental-by-cursor does not.

A classic "incremental" pull (WHERE updated_at > :last_run) can never see a hard delete. The row is simply gone, with no updated_at to find it by. CDC reads the DELETE event straight from the log. If correctness matters — and for fraud signals, customer churn, or compliance reports it absolutely does — this distinction is the whole ballgame.

CDC doesn't have to be an always-on stream.

You can read the log in scheduled micro-batches: each run picks up from the last saved log position, drains the events that accumulated since, applies them, and checkpoints the new position. The result is streaming-grade correctness(every change, in order, including deletes) with batch-grade operational simplicity(cron, idempotent runs, easy retries). This is the most important design choice for analytics workloads.

The Iceberg write model: equality deletes

Naive append-only lakes can't represent updates or deletes without duplicate rows and tombstone hacks, pushing reconciliation cost onto every reader or onto a nightly rewrite job. Apache Iceberg v2 solves this natively.

Its merge-on-read design records a change as a small delete filerather than rewriting the underlying data. The write is fast because no data files are read or rewritten. At query time, the engine merges delete files over the data files to produce the correct current view. That trade — super-fast writes, slower reads — is what makes minute-level CDC into a lake possible at all.

Concretely, for a CDC event stream writing to an Iceberg table:

Source eventWhat gets written to Iceberg
INSERTA new Parquet data file
UPDATEAn equality-delete file marking the old row by primary key, plus a data file with the new row
DELETEAn equality-delete file marking the row by primary key
Equality deletes are write-optimised, not read-optimised. If delete files accumulate — very common in high-frequency CDC — read amplification becomes severe, driving up query latency and cost. Compaction is mandatory for CDC workloads.

Compaction rewrites small data files and delete files into larger, consolidated data files, resetting read performance to baseline. It runs externally — AWS Glue compaction, Spark rewrite_data_files, Apache Amoro, or a periodic Athena OPTIMIZE — on a schedule tuned to your update rate and query SLA.

Before going to production, verify each of these against your environment:

  • Tables without a primary key fall back to append-only. Upserts are keyed on the PK. A table without one can't resolve which existing row to update or delete.
  • Log retention must exceed your worst-case recovery window. PostgreSQL replication slots hold WAL until the consumer confirms; MySQL binlog retention is time-based; MongoDB oplog is size-based.
  • One log consumer per source stream. Replication slots, binlog consumers, and oplog readers deliver events once. Two jobs reading the same stream will each see only a partial event set.
  • Source database prerequisites vary by engine. PostgreSQL requires wal_level = logical. MySQL requires binlog_format = ROW, GTID enabled, and binlog_row_image = FULL. MongoDB requires replica-set mode.
  • Schema changes mid-stream require care. Adding a nullable column is generally safe. Renaming, dropping, or changing a column's type can break the pipeline mid-run.

Streaming, micro-batch, or nightly batch: the strategic choice

The ingestion mode spectrum: batch → micro-batch → streamingNightly Batch12–24 h staleMicro-batch CDC5–60 min · idempotentstreaming correctness · cron simplicityAlways-on Streamseconds · 24/7 cluster costSimple opsNo log access neededMisses deletesCaptures INS / UPD / DELCheckpoint-based recoverySweet spot for analyticsSub-second freshnessStateful cluster requiredHigher cost & complexityrecommended for most analytics workloads
The ingestion spectrum: micro-batch CDC sits at the sweet spot between operational simplicity and data freshness.
Ingestion patternFreshnessOperational modelBest fit
Nightly batch12–24 hCron, simple but fragile at scaleLow-churn, small tables; freshness not a requirement
Scheduled micro-batch CDC5–60 min (tunable)Idempotent cron runs, checkpoint-based, easy retriesMost analytics platforms; operational dashboards; fraud signals
Always-on streaming CDCSeconds–minutesPersistent cluster, stateful offsets, 24/7 costSub-minute SLAs; high-velocity event data

Why micro-batch is often the right call

  • Idempotent and checkpointed. Each run reads from the last confirmed log position, writes to Iceberg, and saves the new position. A run that dies halfway resumes cleanly from the last checkpoint.
  • Freshness is a dial, not a rebuild. Want 10-minute data? Set the cron interval to 10 minutes. Changing the freshness SLA is a configuration edit, not a re-architecture.
  • Compute is mostly idle, so it's cheap. An always-on streaming cluster runs 24/7. A micro-batch worker is idle between ticks — a fraction of the cost, especially against Glue's $0.44/DPU-hour baseline.

Financial math

The financial case turns on where DPU-hours go. AWS Glue bills standard Apache Spark ETL at $0.44/DPU-hour (Flex: $0.29, a 34% discount). CDC attacks both the source read cost and the nightly Glue rewrite line item simultaneously.

Nightly batch + GlueCDC into Iceberg
Source bytes read / day~100% of every tableDelta only (often <1%)
Heavy Glue rewriteEvery night, full tableReplaced by periodic small-file compaction
Data freshness12–24 hMinutes (tunable)
Captures deletes?Only with full refreshYes, from the log
Failure blast radiusWhole nightly windowOne checkpointed micro-batch

Iceberg v3: how it changes the game for CDC micro-batch

Every micro-batch pipeline built on Iceberg v2 carries one structural liability: the more frequently you sync, the faster equality-delete files pile up, and the slower your queries get. Iceberg v3, now available on AWS Glue ETL, Amazon EMR 7.12, and S3 Tables, removes that liability at the format level with two features built specifically for high-frequency, write-heavy workloads.

Deletion Vectors: micro-batch writes no longer fight query performance

In v2, every UPDATE or DELETE in a micro-batch run writes a new equality-delete file. Ten runs an hour on a hot table means ten new delete files per hour, and the query engine must join all of them against the data files on every read. Iceberg v3 replaces this with Deletion Vectors — a compact Roaring Bitmap stored in a Puffin sidecar file paired one-to-one with its data file. Delete state is co-located with data, not scattered across a growing pool of separate files.

  • High-frequency syncs stay cheap. Upserts no longer accumulate write amplification between compaction windows.
  • Compaction becomes maintenance, not survival. It's no longer racing against a degrading query performance curve.
  • Sub-10-minute cadences become practical. The performance ceiling that made aggressive micro-batch schedules risky on v2 is substantially raised.

Row lineage: the lakehouse becomes the source of truth for change tracking

V3 eliminates the need for custom updated_at columns, separate audit tables, or full partition re-scans. Every row automatically carries two metadata fields:_row_id (a stable identifier across the row's entire lifetime) and_last_updated_sequence_number (the snapshot sequence number of the last modification). Downstream consumers can query directly for rows changed since a given checkpoint — no custom tracking logic, no full re-scan.

Upgrading existing tables

V3 is backward compatible: existing v2 tables upgrade atomically without rewriting data files. The upgrade is one-way — v3 cannot be downgraded — so validate in a non-production environment first. All AWS analytics engines (Glue, EMR, Athena, SageMaker) support both v2 and v3 simultaneously.

Should you ditch the batch?

Reach for CDC-into-Iceberg when:

  • Your sources are PostgreSQL, MySQL, MongoDB, or SQL Server with logical replication / binlog / oplog available — or any source reachable via DMS.
  • You need deletes captured correctly, or freshness measured in minutes, not hours.
  • Your Glue bill is dominated by full-table rewrites of mostly-unchanged data.

Stay on batch for now when:

  • Sources don't expose a log (some managed SaaS APIs, file drops).
  • Daily freshness is genuinely sufficient and volumes are small.
  • Your team can't yet commit to running compaction as an ongoing operational concern.

How ETL Engine implements this

ETL Engine is CloudZA's managed CDC-into-lakehouse control plane. It puts the concepts above into production across any AWS region without requiring teams to wire up and operate the individual moving parts.

Supported sources

SourceDirect log CDCVia DMS → S3/Kafka
PostgreSQL✓ WAL / pgoutput
MySQL / MariaDB✓ binlog / GTID
MongoDB✓ oplog
SQL Server✓ CDC change tables
OracleRoadmap✓ via DMS
IBM Db2✓ via DMS
Amazon S3 / files✓ file ingestion (snapshot + incremental)
Apache Kafka✓ streaming topic ingestion (DMS output or direct)

Frequently asked questions

Is change data capture the same as a real-time stream?
No. CDC defines what you read (the database transaction log, including deletes), not how often. You can run CDC as an always-on stream or as scheduled micro-batches that drain the log from the last checkpoint. Micro-batches give streaming-grade correctness with cron-grade simplicity.
Why is compaction mandatory for CDC into Iceberg?
In Iceberg v2, updates and deletes are recorded as equality-delete files. Fast to write, but accumulated delete files cause severe read amplification over time. Scheduled compaction rewrites those files into consolidated data files, resetting query performance to baseline. Iceberg v3 reduces this pressure significantly through Deletion Vectors — but compaction remains good practice regardless of format version.
How much does CDC reduce AWS Glue costs?
It eliminates the largest line item: the nightly full-table rewrite. After the initial snapshot, CDC reads only the delta (often under 1% of a table) and writes columnar Parquet directly — avoiding the two-pass Glue DPU cost of landing-then-rewriting.
Does CDC capture hard deletes?
Yes, and this is its defining advantage over cursor-based incremental pulls. A WHERE updated_at > :last_run query can never see a deleted row because it's gone from the table. CDC reads the DELETE event directly from the log, keeping the lakehouse consistent with the source system.
What happens when a source table has no primary key?
Iceberg equality deletes work by matching rows against a key. Without a primary key, there is no stable identifier to match against, so the CDC pipeline cannot resolve which existing row an update or delete refers to. The result is append-only behaviour: every change arrives as a new row rather than a correction, and duplicates accumulate.
Is your Glue bill mostly re-reading data that didn't change?

We build and operate CDC-into-lakehouse pipelines across multi-region AWS environments, with data-residency controls, runtime credential isolation, and built-in compaction. Let's talk about a CDC pilot.

Start a conversation →