Job Board
Consulting

Apache Iceberg vs Delta Lake: Choosing a Table Format

Both Iceberg and Delta Lake give you ACID transactions, time travel, and schema evolution on top of object storage. But they make different architectural trade-offs that matter when you're building Spark Scala pipelines. Here's a practical comparison to help you choose.

What Table Formats Actually Do

If you're writing Spark Scala applications that read and write Parquet on S3 or HDFS, you've already hit the limitations of raw files: no transactions, no schema enforcement, no way to safely update or delete rows without rewriting entire partitions.

Table formats solve this by adding a metadata layer on top of your data files. They track which files belong to a table, manage concurrent writes, and enable operations like MERGE INTO, DELETE, and time travel queries. Both Iceberg and Delta Lake do this — they just do it differently.


Metadata Architecture

This is the fundamental design difference, and most of the practical trade-offs flow from it.

Delta Lake uses a transaction log — a sequence of JSON files (plus periodic Parquet checkpoints) stored in a _delta_log/ directory alongside your data. Every operation appends a new entry to this log. To read the current table state, a client replays the log from the last checkpoint forward.

-- Delta Lake stores its transaction log alongside the data
-- s3://warehouse/orders/
--   _delta_log/
--     00000000000000000000.json   -- initial commit
--     00000000000000000001.json   -- insert
--     00000000000000000002.json   -- merge
--     00000000000000000010.checkpoint.parquet  -- periodic checkpoint
--   part-00000-*.parquet          -- data files
--   part-00001-*.parquet

This is simple and self-contained. The entire table state lives in one directory — no external catalog required. But as tables grow to thousands of partitions and millions of files, replaying the log from the last checkpoint can become expensive.

Iceberg uses a hierarchical metadata tree: a metadata file points to manifest lists, which point to manifest files, which track individual data files. Each snapshot captures a complete picture of the table at a point in time.

-- Iceberg's hierarchical metadata structure
-- metadata/
--   v1.metadata.json              -- table metadata (schema, partition spec, snapshots)
--   snap-001.avro                 -- manifest list (points to manifests)
--   manifest-001.avro             -- manifest file (tracks data files + partition stats)
--   manifest-002.avro
-- data/
--   part-00000-*.parquet          -- data files
--   part-00001-*.parquet

This tree structure enables efficient pruning — Iceberg can skip entire manifest files based on partition statistics without scanning every file entry. For large tables with complex partition layouts, this matters. The trade-off is complexity: Iceberg requires a catalog to resolve table names to the current metadata file location.


Partition Evolution: Iceberg's Standout Feature

This is the single biggest differentiator between the two formats, and the reason many teams choose Iceberg.

Iceberg uses hidden partitioning. You define partition transforms on existing columns — days(event_time), bucket(16, user_id) — and Iceberg handles the rest. Queries don't need to reference partition columns directly; the engine prunes partitions automatically based on predicates.

// Iceberg: create a table with partition transforms
spark.sql("""
  CREATE TABLE events (
    event_id BIGINT,
    event_time TIMESTAMP,
    user_id STRING,
    payload STRING
  ) USING iceberg
  PARTITIONED BY (days(event_time), bucket(16, user_id))
""")

// Queries filter on the source column — Iceberg prunes partitions automatically
spark.sql("SELECT * FROM events WHERE event_time > '2026-01-01'")
// No need to add WHERE event_date = '2026-01-01' on a synthetic partition column

More importantly, Iceberg supports partition evolution — changing the partition scheme on a live table without rewriting existing data. New data uses the new spec; old data keeps its original layout. Iceberg's query planner handles both transparently.

// Iceberg: evolve partitioning from daily to hourly — metadata-only, no data rewrite
spark.sql("ALTER TABLE events REPLACE PARTITION FIELD days(event_time) WITH hours(event_time)")

// Old data is still partitioned by day, new data by hour
// Queries spanning both periods work seamlessly

Delta Lake does not support partition evolution. Once you define a partition scheme, changing it requires creating a new table and rewriting all data. This makes the initial partition design decision much more consequential — get it wrong, and you're looking at a costly migration.

For teams whose data access patterns evolve over time — which is most teams — this is a real advantage for Iceberg.


Catalog Requirements

Delta Lake is self-contained. The transaction log sits in _delta_log/ next to the data. Any Spark session can read and write a Delta table by path alone — no catalog, no external service, no coordination beyond the filesystem.

// Delta Lake: access by path — no catalog needed
val df = spark.read.format("delta").load("s3://warehouse/orders")

// Or register in Spark's session catalog
spark.sql("CREATE TABLE orders USING delta LOCATION 's3://warehouse/orders'")

Delta Lake 4.0 introduced catalog-managed tables as a preview feature, where the catalog becomes the coordinator of table access and source of truth. This is the direction Delta is heading — especially for teams using Unity Catalog — but it's optional, not required.

Iceberg requires a catalog. The catalog resolves table names to the current metadata file location and handles atomic metadata pointer updates. Without a catalog, Iceberg can't track which metadata file is current.

// Iceberg: requires catalog configuration
// build.sbt
libraryDependencies ++= Seq(
  "org.apache.iceberg" % "iceberg-spark-runtime-4.0_2.13" % "1.10.0"
)

// SparkSession with Hive catalog for Iceberg
val spark = SparkSession.builder()
  .config("spark.sql.catalog.my_catalog", "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.my_catalog.type", "hive")
  .config("spark.sql.catalog.my_catalog.uri", "thrift://metastore:9083")
  .getOrCreate()

// Access tables through the catalog
spark.sql("SELECT * FROM my_catalog.db.orders")

The catalog options include Hive Metastore, AWS Glue, Nessie (for git-like table versioning), and any Iceberg REST catalog implementation. This is operational overhead, but it also means Iceberg tables are inherently catalog-aware — which simplifies multi-engine access.

For small teams or simple deployments, Delta's path-based access is easier to get running. For organizations that already have a metastore or catalog in place, Iceberg's catalog requirement is not much of a burden.


Engine Support

This is where organizational context matters most.

Delta Lake is deepest on Spark. It was built for Spark, implemented in Scala, and the most advanced features — Spark Connect integration, coordinated commits, catalog-managed tables — are Spark-first. Beyond Spark, Delta Lake supports Flink, Trino, and Presto through connectors, but the feature set is narrower.

Delta Lake bridges the engine gap with UniForm — a feature that generates Iceberg (and Hudi) metadata alongside the Delta transaction log, against the same Parquet data files. This lets Iceberg-compatible engines read Delta tables without data duplication. UniForm is generally available but read-only from the Iceberg side — all writes go through Delta.

Iceberg was designed to be engine-agnostic from the start. It has native, well-maintained integrations with Spark, Flink, Trino, Presto, Athena, Snowflake, Dremio, and BigQuery. The Iceberg community includes contributors from multiple vendors, and no single company controls the integration story.

Iceberg 1.10.0 shipped full Spark 4.0 compatibility, resolving earlier gaps that had blocked the Delta Lake 4.0 UniForm integration.

The practical question: if your stack is Spark-only (or Spark plus Databricks), Delta Lake gives you the tightest integration with the least operational overhead. If you need Trino, Athena, Snowflake, or Flink to query the same tables, Iceberg's native multi-engine support is more robust than relying on UniForm's read-only bridge.


Schema Evolution

Both formats handle the basics — adding, dropping, renaming, and reordering columns without rewriting data. In practice, the differences are subtle.

Iceberg tracks columns by unique ID rather than by name or position. This means renaming a column is a metadata-only operation, and reordering columns doesn't break downstream readers. Iceberg also supports type promotion (e.g., int to long) as a metadata change.

Delta Lake tracks columns by name and supports similar evolution operations. Type widening graduated to GA in Delta Lake 4.0, allowing in-place type changes like INT to LONG or FLOAT to DOUBLE without data rewrites.

For most Spark Scala teams, schema evolution works well in both formats. Iceberg's column-ID tracking is more theoretically robust, but Delta's name-based approach rarely causes issues in practice.


File Format Support

A minor but sometimes relevant difference: Iceberg supports Parquet, Avro, and ORC. Delta Lake supports Parquet only.

If your organization has existing ORC data or pipelines that produce Avro, Iceberg can work with them directly. For most Spark Scala teams starting fresh, Parquet is the default choice anyway, so this rarely drives the decision.


Community and Governance

Iceberg is an Apache Software Foundation project with vendor-neutral governance. Contributors include engineers from Apple, Netflix, AWS, Google, Snowflake, and Databricks (following the Tabular acquisition). Over the last 12 months, Iceberg has seen roughly 2,900 contributors and 4,100 pull requests.

Delta Lake is a Linux Foundation project, but Databricks remains the primary contributor and steward. It has roughly 1,300 contributors and 2,000 pull requests over the same period. This doesn't mean Delta is less capable — Databricks invests heavily in it — but the contributor base is narrower.

For teams concerned about vendor lock-in at the format level, Iceberg's broader contributor base provides more insurance against any single company's priorities shifting.


Decision Framework

After working through the architectural differences, here's a practical framework for Spark Scala teams:

Choose Delta Lake if:

  • Your stack is Spark-centric, especially if you're on Databricks
  • You value simplicity — path-based access, no catalog dependency, self-contained tables
  • You're already using Delta Lake in production and the migration cost to Iceberg isn't justified
  • UniForm is sufficient for any cross-engine read access you need
  • You want the tightest possible Spark integration, including features like Delta Connect and Variant support

Choose Iceberg if:

  • Your data platform spans multiple engines — Spark plus Trino, Flink, Athena, or Snowflake
  • Multi-cloud portability matters and you want to avoid coupling to one vendor's ecosystem
  • Your partition schemes are likely to evolve — Iceberg's partition evolution is genuinely valuable
  • You have existing ORC or Avro data you want to manage under a table format
  • Vendor-neutral governance is important to your organization

Either works if:

  • You're running single-engine Spark workloads with stable partition schemes and no cross-engine requirements — both formats are mature, well-tested, and performant for this use case

The Convergence Trend

The boundary between these formats is blurring. Delta Lake's UniForm generates Iceberg metadata. Iceberg's REST catalog spec is becoming a standard that multiple platforms adopt. Databricks acquired Tabular (the Iceberg company). Both formats support the same underlying Parquet data files.

As noted in The State of Spark Scala in 2026, the ecosystem is consolidating rather than fragmenting. The most likely outcome isn't one format "winning" — it's interoperability becoming good enough that the choice matters less over time. But we're not there yet, and today the decision is real.

Pick the format that fits your current architecture and operational reality. If you're on Spark and Databricks, Delta Lake is the natural choice. If you're building a multi-engine data platform, Iceberg gives you more flexibility. Both are solid foundations for Spark Scala data pipelines.

Article Details

Created: 2026-04-09

Last Updated: 2026-04-09 10:08:18 PM