Job Board
Consulting

Apache Polaris: The Open Standard Iceberg Catalog

Apache Polaris graduated to a top-level ASF project in February 2026 and is consolidating as the default open implementation of the Iceberg REST Catalog spec. For Spark Scala teams, it's the piece that lets Spark, Trino, and Flink work against the same Iceberg tables with one source of truth — without Hive Metastore, without per-engine catalog plumbing, and without vendor lock-in.

For broader context, see Apache Iceberg v3: What's New for Spark Users and Apache Iceberg vs Delta Lake: Choosing a Table Format.

The Headline

Polaris entered the Apache Incubator in August 2024 — originally donated by Snowflake — and graduated to a top-level project on February 19, 2026. Eighteen months of incubation produced six releases, around 100 contributors from across the industry, and over 2,800 closed pull requests. The current stable release is 1.5.0.

Graduation is the signal that mattered. It means Polaris cleared the ASF bar for vendor neutrality and community governance — the same path Iceberg itself walked. Snowflake, Dremio, AWS, Microsoft, and a long tail of independents are now contributing on equal footing, and Polaris is the catalog that engine vendors are choosing to integrate with by default.

The practical implication for Spark Scala teams: if you're standing up an Iceberg lakehouse in 2026 and you don't already have a strong reason to use AWS Glue, Unity Catalog, or Nessie, Polaris is the default to evaluate first.


What a Catalog Actually Does

Iceberg's design pushes most of the metadata into files on object storage — manifest files, manifest lists, snapshots, partition specs. The catalog's job is narrower than what Hive Metastore tried to be. It needs to do exactly two things well:

  1. Map table names to the location of the current metadata.json file. A query against sales.orders becomes a lookup that returns s3://warehouse/sales/orders/metadata/00042-abc.json. From there, Iceberg reads the metadata and plans the scan itself.
  2. Atomically swap that pointer on commit. When a write produces a new metadata.json, the catalog has to point the table at the new file in a single atomic operation — no other writer can interleave. This is what gives Iceberg its serializable isolation.

That's it. Schema evolution, partition evolution, snapshots, time travel — those all live in the metadata files. The catalog is the small piece that arbitrates "which metadata.json is current right now."

Everything else a modern catalog does — access control, credential vending, multi-tenancy, audit — is layered on top of those two primitives.


Why REST Beats Hive Metastore

For a decade the default Iceberg catalog was Hive Metastore (HMS). It worked, but the design baggage is real. HMS stores metadata in a relational database and exposes it through a Thrift RPC interface that every engine needs a dedicated client for. The schema was designed for Hive partitions, not Iceberg snapshots, so Iceberg's catalog calls have to round-trip through an abstraction that doesn't quite fit. And HMS doesn't natively understand credential vending, so every engine that reads the table needs its own long-lived storage credentials.

The Iceberg REST Catalog specification fixed this by making the catalog API first-class. It's a plain HTTP+JSON contract — every Iceberg-aware engine speaks it, no special client library needed beyond an HTTP stack. The spec includes credential vending out of the box: the catalog can return scoped, short-lived storage credentials with each table response, so engines never need to hold long-lived S3 keys.

Polaris is the open-source reference implementation of that spec, plus the operational features (RBAC, multi-catalog management, audit) that teams need to actually run it in production. If you're already on HMS and it works, you don't have to migrate yesterday. But every greenfield Iceberg deployment in 2026 should start with REST, and Polaris is the implementation that has the most engines aligned behind it.


The Architecture: Catalogs, Principals, Roles

Polaris adds three administrative concepts on top of the bare REST catalog spec:

  • Catalogs — Named, isolated namespaces. Each catalog has its own storage configuration (S3 bucket, GCS bucket, ADLS container) and its own access boundary. A typical setup is one catalog per environment (dev, staging, prod) or per data domain (finance, analytics, ml_features).
  • Principals — Client identities. A Spark job, a Trino cluster, a CI pipeline — each gets its own principal with a client ID and secret. Principals authenticate via OAuth2.
  • Principal Roles and Catalog Roles — Permissions. Principal roles map principals to catalog roles; catalog roles grant specific privileges (TABLE_READ_DATA, TABLE_WRITE_DATA, NAMESPACE_CREATE, etc.) on namespaces and tables. This is the standard role-as-permission-bundle pattern from any RBAC system.

The piece that makes this work in practice is credential vending. When a Spark job authenticates to Polaris and requests access to a table, Polaris uses its own privileged storage credentials to mint a scoped, short-lived token (an STS session for AWS, a signed URL for GCS) that grants only the access that role allows on the underlying object storage. The Spark job never holds the warehouse's long-lived credentials — it holds a Polaris OAuth2 token, and Polaris hands it the storage token it needs for each operation.

This is the security property that's hard to retrofit onto HMS. With REST + credential vending, the catalog becomes the single point of policy enforcement for the whole lakehouse.


Connecting Spark to Polaris

The Spark side is plain Iceberg REST catalog configuration — Polaris doesn't need a special connector because it implements the spec faithfully. For Spark 4.0 with Iceberg 1.10:

// build.sbt — Iceberg runtime + AWS bundle for S3 credential vending
libraryDependencies ++= Seq(
  "org.apache.iceberg" % "iceberg-spark-runtime-4.0_2.13" % "1.10.0",
  "org.apache.iceberg" % "iceberg-aws-bundle"             % "1.10.0"
)

Then configure the SparkSession to point at the Polaris REST endpoint:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .appName("polaris-demo")
  .config("spark.sql.extensions",
          "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
  // Register a catalog named "warehouse" backed by Polaris
  .config("spark.sql.catalog.warehouse",
          "org.apache.iceberg.spark.SparkCatalog")
  .config("spark.sql.catalog.warehouse.catalog-impl",
          "org.apache.iceberg.rest.RESTCatalog")
  .config("spark.sql.catalog.warehouse.uri",
          "https://polaris.example.com/api/catalog")
  .config("spark.sql.catalog.warehouse.warehouse",
          "analytics_catalog")
  // OAuth2 — the credential is the principal's client_id:client_secret
  .config("spark.sql.catalog.warehouse.credential",
          sys.env("POLARIS_CLIENT_ID") + ":" + sys.env("POLARIS_CLIENT_SECRET"))
  .config("spark.sql.catalog.warehouse.scope", "PRINCIPAL_ROLE:ALL")
  .config("spark.sql.catalog.warehouse.token-refresh-enabled", "true")
  // Tell Polaris to vend short-lived S3 credentials with each table response
  .config("spark.sql.catalog.warehouse.header.X-Iceberg-Access-Delegation",
          "vended-credentials")
  .getOrCreate()

The two pieces that distinguish this from a generic REST catalog config are the OAuth2 credential / scope settings and the X-Iceberg-Access-Delegation: vended-credentials header. The header is what tells Polaris to mint a scoped storage token and return it alongside the table metadata, so the Spark job doesn't need any S3 credentials of its own.

Once the catalog is registered, the table reference uses three-part naming: <catalog>.<namespace>.<table>.

spark.sql("""
  CREATE TABLE warehouse.sales.orders (
    order_id   BIGINT,
    customer   STRING,
    amount     DECIMAL(10,2),
    order_date DATE
  )
  USING iceberg
  PARTITIONED BY (days(order_date))
""")

spark.sql("""
  INSERT INTO warehouse.sales.orders VALUES
    (1, 'alice', 49.99, DATE'2026-05-29'),
    (2, 'bob',   19.95, DATE'2026-05-29')
""")

spark.sql("SELECT * FROM warehouse.sales.orders").show()

No HMS. No long-lived S3 keys in spark-defaults.conf. The credential boundary is the Polaris principal.


The Multi-Engine Story

The reason to standardize on REST + Polaris is what happens when a second engine shows up. Configure Trino to point at the same Polaris endpoint with its own principal:

# Trino catalog config: etc/catalog/warehouse.properties
connector.name=iceberg
iceberg.catalog.type=rest
iceberg.rest-catalog.uri=https://polaris.example.com/api/catalog
iceberg.rest-catalog.warehouse=analytics_catalog
iceberg.rest-catalog.security=OAUTH2
iceberg.rest-catalog.oauth2.credential=${TRINO_CLIENT_ID}:${TRINO_CLIENT_SECRET}
iceberg.rest-catalog.oauth2.scope=PRINCIPAL_ROLE:ALL
iceberg.rest-catalog.vended-credentials-enabled=true

Trino now sees warehouse.sales.orders as the same table Spark just wrote. No catalog sync, no metadata replication, no "Trino-side schema definition." Both engines hit the same metadata.json pointer through the same REST API. Add Flink, add DuckDB's Iceberg extension, add a Python service via PyIceberg — they all see the same source of truth.

The role boundary holds across engines too. If the Trino principal only has TABLE_READ_DATA on the sales namespace, no amount of INSERT INTO from Trino will succeed — the credential vending step won't return write-capable storage tokens, and Polaris will reject the metadata commit anyway. The access policy lives in one place, not in N per-engine ACL systems.

This is the property that makes the Polaris bet work. Iceberg is the open table format, REST is the open catalog API, and Polaris is the open implementation tying them together. There's no engine-specific lock-in at any layer.


Migrating from Hive Metastore

If you have an existing HMS-backed Iceberg deployment, the migration is metadata-only — your data files don't move. The Iceberg register_table procedure points a new catalog at the existing metadata.json location:

// Run against the new Polaris-backed catalog
spark.sql("""
  CALL warehouse.system.register_table(
    table => 'sales.orders',
    metadata_file => 's3://warehouse/sales/orders/metadata/00042-abc.json'
  )
""")

This is a single catalog-side commit. The data files, the snapshot history, and the existing partitioning are all preserved. Repeat once per table (a script driven by a SHOW TABLES against HMS handles this cleanly), update your Spark configs to point at Polaris, and decommission HMS once you've verified the cutover.

The Iceberg Catalog Migrator tool automates the table-by-table register loop and is the right starting point for any migration with more than a handful of tables.


Where Polaris Fits vs Unity Catalog and Nessie

Polaris isn't the only open catalog in town. The honest framing:

  • Unity Catalog — Originally Databricks-internal, now Linux Foundation, with a strong governance and lineage story and tight Delta Lake integration. The open-source surface is real but still maturing; the most polished experience is on Databricks itself. Choose Unity if you're a Databricks shop or if data governance (column-level policies, lineage, audit) is the leading requirement and you want a single tool for Iceberg + Delta + non-table assets.
  • Nessie — Git-style branching and merging for catalog state. Choose Nessie if branching is a first-class workflow you need — multi-table atomic commits, dev branches off prod, ML experimentation isolation.
  • AWS Glue — Managed, cheap, default on AWS, and now supports the Iceberg REST API. Choose Glue if you're all-in on AWS and want one less service to run.
  • Polaris — The pure REST-spec implementation with the broadest engine alignment. Choose Polaris if multi-engine interoperability and vendor neutrality are the primary requirements, you want self-hosted (or any-cloud) deployment, and you don't need Git-style branching or a heavy governance layer beyond RBAC + credential vending.

These overlap considerably. The trend across all four is toward the REST spec — Glue added REST support, Unity Catalog can expose Iceberg tables via REST, Nessie supports it natively. A few years out, "which catalog" may matter less than "which spec." Polaris is the bet that the REST spec wins and the cleanest reference implementation is the right place to land.


Operational Notes

A few things that aren't in the quickstarts but matter once you put Polaris in production:

  • Persistence backend. The default in-memory metastore is for development only. For production, configure the JDBC or EclipseLink metastore against a managed Postgres. Snapshot it like any other stateful service.
  • High availability. Polaris is a stateless Java service in front of a database. Run multiple replicas behind a load balancer, scale them with traffic, and treat the database as the durable layer.
  • OAuth2 backend. Polaris ships an internal OAuth2 implementation, but most production deployments front it with a real identity provider (Keycloak, Okta, AWS IAM Identity Center) and let Polaris validate JWTs.
  • Kubernetes. Polaris ships official Helm charts and runs cleanly on Kubernetes. The deployment pattern is identical to any other stateless JVM service — pair it with the Apache Spark Kubernetes Operator and you have the whole lakehouse stack on the same cluster.
  • Audit logging. Every catalog operation produces an audit event. Pipe these to your standard log aggregation (CloudWatch, Loki, Datadog) — this is the data you'll want when someone asks "who dropped this table."

Should You Adopt Polaris?

Yes, if you're starting fresh on Iceberg. The REST spec is the future, Polaris is the cleanest open implementation, and the multi-engine story is real. Build on it now and you're aligned with where the ecosystem is moving.

Yes, if you're on HMS and feeling the operational pain. Credential management, multi-engine plumbing, and the impedance mismatch between Hive's data model and Iceberg's all go away with a REST catalog. Migration is metadata-only.

Probably wait, if you're deep in Databricks. Unity Catalog is the path of least resistance there, and it's increasingly interoperable with Iceberg REST clients anyway. Polaris would be a parallel system without a strong payoff in that environment.

Not yet, if your hard requirement is Git-style branching. Nessie does that better, and that capability isn't on the Polaris roadmap in the same way.


Quick Checklist

To stand up a Polaris-backed Iceberg lakehouse with Spark:

  • Deploy Polaris with a real Postgres backend and an external IdP
  • Create a catalog per environment, configured with the warehouse S3/GCS/ADLS path
  • Create principals per workload (Spark job, Trino cluster, CI) with scoped roles
  • Configure your Spark application with the REST catalog settings, including X-Iceberg-Access-Delegation: vended-credentials
  • Verify reads and writes work without long-lived storage credentials in the Spark config
  • Add a second engine against the same catalog to confirm interoperability
  • Pipe audit logs into your standard log aggregation

For deeper background, see the Apache Polaris site, the Iceberg REST Catalog specification, and the related coverage of Iceberg v3 and the Iceberg vs Delta Lake comparison.

Article Details

Created: 2026-05-29

Last Updated: 2026-05-29 10:06:10 PM