Job Board
Consulting

Apache Iceberg v3: What's New for Spark Users

Iceberg v3 is the first format-version bump since 2021 and finally lands the features Spark Scala teams have been working around for years: deletion vectors that replace position-delete files, mandatory row lineage for cheap CDC, a VARIANT type shared with Spark and Delta, default column values, geospatial types, and nanosecond timestamps. Here's what each feature actually does, which pieces are production-ready on Spark 4.0 today, and what to watch out for if you flip a table to format-version = 3.

For broader context, see Apache Iceberg vs Delta Lake: Choosing a Table Format and The State of Spark Scala in 2026.

The Headline

The Iceberg v3 spec was approved by the Apache Iceberg community in late 2025 and rolled out across engines through early 2026. As of May 2026:

  • Apache Spark 4.0 + Iceberg 1.10.x has the most complete open-source v3 support — most production deployments using v3 today are on this combination.
  • AWS announced v3 deletion vectors and row lineage across EMR 7.12, Glue, SageMaker notebooks, S3 Tables, and the Glue Data Catalog in November 2025.
  • Snowflake marked v3 generally available on May 7, 2026.
  • Databricks moved v3 to Public Preview on April 24, 2026, with a subset of features supported on Runtime 18.0+.
  • Open-source Trino has not yet adopted v3; only the Starburst Galaxy fork does. PyIceberg can read v3 but writes are limited.

The practical implication: if your stack is Spark plus AWS, plus optionally Snowflake or Databricks, you can use v3 in production today. If you need open-source Trino or Athena to read the same tables, give it another quarter or two.

The full picture is laid out in the official Iceberg specification and Google's v3 overview.


Creating a v3 Table

The simplest way in. Either set the format version at create time or upgrade an existing v2 table in place.

// Set up the Iceberg catalog (Spark 4.0 + iceberg-spark-runtime-4.0_2.13 1.10+)
// build.sbt
libraryDependencies ++= Seq(
  "org.apache.iceberg" % "iceberg-spark-runtime-4.0_2.13" % "1.10.0"
)

// Create a v3 table directly
spark.sql("""
  CREATE TABLE my_catalog.warehouse.events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id STRING,
    payload STRING
  ) USING iceberg
  TBLPROPERTIES ('format-version' = '3')
""")

// Or upgrade an existing v2 table — metadata-only operation
spark.sql("""
  ALTER TABLE my_catalog.warehouse.events
  SET TBLPROPERTIES ('format-version' = '3')
""")

The upgrade is a single metadata write — no data is rewritten. Existing data files keep their format; new writes use v3 conventions. Row lineage fields are populated from the next write onward; for existing rows, _row_id is null until they're rewritten by a compaction.


Deletion Vectors: The Headline Feature

This is the feature that will land first on most teams' radar because it directly fixes the operational pain of v2 row-level deletes.

The v2 problem. In v2, a DELETE or UPDATE writes a position delete file — a small file listing (file_path, position) pairs of rows to skip. Every read against the table has to consult these centralized delete files and apply the masks during scan. As you accumulate deletes, you accumulate small position-delete files. Compaction is essentially mandatory, and CDC pipelines that DML constantly become a small-files nightmare. The small files problem Spark teams know from the data layer reappears on the metadata side.

The v3 fix. Each data file now gets a paired deletion vector — a Puffin sidecar file containing a Roaring bitmap marking which rows are deleted. There is at most one deletion vector per data file. Readers consult the local bitmap during the scan of file_A.parquet — they don't need to globally join against a delete-file index. This dramatically simplifies the read path and shrinks the metadata footprint.

The binary format is the same one Delta Lake uses for deletion vectors, which is part of the larger format-convergence story (Spark and Delta share the same on-disk DV encoding, so engines that read one can read the other).

// With format-version = 3, DELETE / UPDATE / MERGE automatically use deletion vectors.
// Nothing changes in your Scala code.

spark.sql("""
  MERGE INTO my_catalog.warehouse.events t
  USING staging_events s
  ON t.event_id = s.event_id
  WHEN MATCHED THEN UPDATE SET payload = s.payload
  WHEN NOT MATCHED THEN INSERT *
""")

// Under v2 this would have produced position-delete Parquet files in data/.
// Under v3 it writes a Puffin .puffin file alongside each affected data file,
// with a Roaring bitmap of deleted row positions.

What this means in practice.

  • Read amplification on heavily-DML'd tables drops significantly. Whether you see 2x or 10x depends on workload shape; teams running CDC into wide tables tend to see the largest gains.
  • Compaction still matters, but the cost of skipping it is lower. v2 made compaction a critical hygiene job; v3 makes it more of a tuning lever.
  • A single data file can have at most one deletion vector. Repeated deletes against the same file update the bitmap in place rather than writing a new delete file each time.

This is the feature that's furthest along across engines and the one most teams should evaluate first.


Row Lineage: Built In, Not Opt-In

Row lineage is mandatory in v3 — there is no flag to turn it off. Every row in a v3 table carries two implicit fields:

  • _row_id — a long that uniquely identifies the row across its entire lifetime in the table. The ID survives updates, compactions, and partition changes. It only changes if you INSERT a fresh row.
  • _last_updated_sequence_number — the table-level sequence number of the snapshot that last modified the row.

Together these give you something Spark Scala teams have been hand-rolling for years: stable row identity plus a watermark for "what changed since snapshot N." The CDC use case is the obvious one — you can build an incremental reader that filters on _last_updated_sequence_number > my_last_seen and gets exactly the rows touched since your last run, without an external last_modified column or a separate Delta CDF feed.

// Reading row lineage metadata columns
val changed = spark.sql("""
  SELECT _row_id, _last_updated_sequence_number, event_id, payload
  FROM my_catalog.warehouse.events
  WHERE _last_updated_sequence_number > 12345
""")

// Same idea, but using the snapshot metadata table to pick the watermark
val latestSeq = spark.sql("""
  SELECT MAX(sequence_number) FROM my_catalog.warehouse.events.snapshots
""").as[Long].head()

A few practical notes:

  • Storage cost is real. Two new columns per row at v3 minimum. For wide rows it's negligible; for narrow event-store tables it can be a few percent of file size. There's no way to disable it in v3.
  • Migration is gradual. When you upgrade a v2 table to v3, existing rows have _row_id = null until rewritten. New inserts get populated immediately. A compaction pass after the upgrade fills the rest.
  • The next-row-id is a table-level counter. Writers must coordinate through the catalog to allocate IDs. This is fine for normal catalog-mediated writes; it does mean you can't bypass the catalog and dump data files directly into a v3 table.

VARIANT: Shared With Spark and Delta

v3 introduces the VARIANT data type — binary-encoded, queryable, semi-structured data with no upfront schema. The encoding is the same open specification Spark 4.0's native VARIANT uses, which is in turn shared with the Delta Lake project.

If you've already read The VARIANT Data Type in Spark 4.0, the surface area is nearly identical from Scala — you call parse_json and variant_get and the data flows through. The interesting thing about Iceberg v3 is that the storage now matches: you can write VARIANT from Spark into an Iceberg table and the bytes that land on disk are the same binary representation Spark uses in memory.

spark.sql("""
  CREATE TABLE my_catalog.warehouse.events_v3 (
    event_id BIGINT,
    event_time TIMESTAMP,
    payload VARIANT
  ) USING iceberg
  TBLPROPERTIES ('format-version' = '3')
""")

// Insert semi-structured payloads
spark.sql("""
  INSERT INTO my_catalog.warehouse.events_v3 VALUES
    (1, current_timestamp(), parse_json('{"user": "alice", "action": "click"}')),
    (2, current_timestamp(), parse_json('{"user": "bob", "action": "purchase", "amount": 49.99}'))
""")

// Query nested paths — same syntax as native Spark VARIANT
spark.sql("""
  SELECT
    variant_get(payload, '$.user', 'string')   AS user,
    variant_get(payload, '$.amount', 'double') AS amount
  FROM my_catalog.warehouse.events_v3
""").show()

What's not yet broadly implemented is variant shredding — the optimization where commonly-accessed paths are materialized as their own typed sub-columns to skip the binary decode entirely. The Iceberg spec accommodates it; engine support varies. Spark 4.1 adds shredding on the write side, but most production v3 tables today store VARIANT as a single binary blob per row. That's still fast — just not as fast as it will be in another release or two.


Default Column Values

A small feature with an outsized quality-of-life impact: v3 lets you specify a default value when you ALTER TABLE ... ADD COLUMN. Existing data files don't have the new column, but queries against them return the default rather than NULL.

// Add a column with a default — no data rewrite, no backfill
spark.sql("""
  ALTER TABLE my_catalog.warehouse.events
  ADD COLUMN region STRING DEFAULT 'unknown'
""")

// Old rows return 'unknown' for region; new rows can supply their own value
spark.sql("SELECT event_id, region FROM my_catalog.warehouse.events").show()

Under v2, this was either a NULL (and you handled it with COALESCE everywhere) or a backfill (and you paid the rewrite cost on a multi-petabyte table). v3 stores the default in the column metadata. The reader injects it at scan time when the file doesn't have the column.

This is one of the v3 features Databricks lists as not yet supported on managed Iceberg as of May 2026, so verify against your specific runtime if you're on Databricks. On Spark 4.0 + Iceberg 1.10 with a standard catalog, it works.


Geospatial Types

v3 defines two new spatial types — GEOMETRY (planar) and GEOGRAPHY (spheroidal) — alongside the encodings and predicates that engines need to do real spatial work. The Iceberg side is the standardized storage layer; the actual query operators come from engine-side libraries like Apache Sedona on Spark.

If you process location data today by stringifying WKT or hand-rolling lat/lng columns, v3 finally gives you a first-class place to put geometry without making the format choice. Engine support is still partial — this is the v3 feature most likely to land in your toolchain incrementally over the next two or three Iceberg releases.


Nanosecond Timestamps

A footnote for most teams, load-bearing for a few. v3 adds timestamp_ns and timestamptz_ns types with nanosecond precision. If you're ingesting from systems that emit nanosecond timestamps (high-frequency trading, certain telemetry pipelines, kernel-level event sources), you no longer have to truncate at the storage boundary.

For everyone else, the microsecond TIMESTAMP type is unchanged and still the default. Don't switch unless you have a specific reason — the wider type is heavier to serialize and most downstream tools don't preserve the extra precision anyway.


Encryption Building Blocks

v3 lays groundwork for table-level encryption — key wrapping, per-file encryption metadata in the manifests, integration with KMS providers. This is genuinely early; engine support is sparse and most teams handling sensitive data still rely on storage-layer encryption (S3 SSE-KMS, etc.) or a separate format like Parquet Modular Encryption.

Treat this as "watch for the next release" rather than "use this today."


Should You Upgrade?

Here's the practical decision tree for Spark Scala teams:

Upgrade now if:

  • Your stack is Spark 4.0 + Iceberg 1.10+ on AWS (or you control the engine selection).
  • You're hitting the v2 position-delete file proliferation in CDC or MERGE-heavy workloads — deletion vectors are the single biggest operational win in v3.
  • You want stable row IDs for incremental processing and don't want to maintain them yourself.
  • You're already using VARIANT in Spark and want the storage layer to match.

Wait if:

  • You need open-source Trino or Athena to read the same tables (Trino still on v2-only; Starburst Galaxy supports v3).
  • Your readers include PyIceberg-based services or DuckDB — write support is limited there.
  • You depend on default values, geospatial types, or nanosecond timestamps and you're on a Databricks runtime older than 18.0 (the supported feature subset excludes some of these as of May 2026).

The migration itself is cheap. A v2-to-v3 upgrade is a single metadata commit. The data files don't change. You can roll back by setting format-version back to 2 before any v3-only operations (deletion vectors, default-value adds) have been written. After that, the rollback path requires a rewrite. The advice: upgrade one non-critical table first, run your read workloads against it for a week, then expand.


The Bottom Line

v3 is the format-version bump Iceberg has needed since v2. Deletion vectors fix the operational pain of row-level deletes. Row lineage gives you cheap CDC without a sidecar. VARIANT and the geospatial types close real type-system gaps. The feature most likely to move your numbers today is deletion vectors — start there.

The format-convergence story is the bigger picture. Iceberg v3's VARIANT and deletion vector formats are the same ones Spark and Delta Lake use natively. Whatever your table format choice, the underlying encodings are increasingly shared. As covered in The State of Spark Scala in 2026, the ecosystem is consolidating rather than fragmenting — and v3 is one of the clearer signs of that consolidation.

For the full spec, see the Apache Iceberg specification, Databricks' overview of v3 motivations, and the v3 release notes from Iceberg 1.10.

Article Details

Created: 2026-05-19

Last Updated: 2026-05-19 10:31:54 PM