Page 1 of 3
-
Delta Lake UniForm: Write Delta, Read Iceberg
Delta Lake's Universal Format generates Iceberg metadata alongside the Delta transaction log, against the same Parquet files, so Iceberg-native engines can read your Delta tables without conversion or duplication. With Delta 4.0.1 restoring Iceberg compat for Spark 4.0, UniForm is once again a usable option for Scala teams that need cross-engine reads — provided you understand the limitations.
-
Apache Polaris: The Open Standard Iceberg Catalog
Apache Polaris graduated to a top-level ASF project in February 2026 and is consolidating as the default open implementation of the Iceberg REST Catalog spec. For Spark Scala teams, it's the piece that lets Spark, Trino, and Flink work against the same Iceberg tables with one source of truth — without Hive Metastore, without per-engine catalog plumbing, and without vendor lock-in.
-
The New Apache Spark Kubernetes Operator: Getting Started
The official Apache Spark Kubernetes Operator launched as an ASF subproject in May 2025, built from scratch instead of forking the aging Kubeflow operator. A year of rapid releases later, it's at 0.9.0 and is the path the Spark community is steering toward for running Scala jobs on Kubernetes.
-
Spark Real-Time Mode vs Apache Flink: Is Spark Finally a True Streaming Engine?
Spark 4.1's Real-Time Mode delivers single-digit millisecond latency for stateless queries and targets sub-300ms p99 on the Databricks runtime, putting Spark within striking distance of Flink for the first time. For most analytics streaming, CDC, and ML feature workloads, RTM closes the gap. For sub-10ms requirements, complex event processing, and true event-at-a-time semantics, Flink still wins — and probably will for a long time.
-
Spark vs Polars: When to Use What in 2026
Polars is the fastest single-node DataFrame engine in open source right now — Rust-backed, multi-threaded, and on small data measurably quicker than DuckDB. It is also not a Spark replacement. For Spark Scala developers the honest framing is: Polars wins for single-node work under ~10GB, Spark wins for everything distributed or stateful, and the JVM story for Polars is thin. Here's a practical decision guide.
-
Apache Iceberg v3: What's New for Spark Users
Iceberg v3 is the first format-version bump since 2021 and finally lands the features Spark Scala teams have been working around for years: deletion vectors that replace position-delete files, mandatory row lineage for cheap CDC, a VARIANT type shared with Spark and Delta, default column values, geospatial types, and nanosecond timestamps. Here's what each feature actually does, which pieces are production-ready on Spark 4.0 today, and what to watch out for if you flip a table to format-version = 3.
-
Apache Gluten: Supercharging Spark with Native C++ Execution
Apache Gluten graduated to an ASF Top-Level Project in March 2026. It pushes Spark's physical operators down to native C++ engines (Velox or ClickHouse) via Substrait and JNI, keeping the JVM in charge of scheduling while the heavy lifting happens off-heap. Here's the architecture, how to wire it up on a Scala workload, and where the rough edges are.
-
Spark Connect for Scala: Building Thin-Client Applications
Spark Connect decouples the application from the cluster with a gRPC protocol, and as of Spark 4.0 the Scala client has near-complete DataFrame and Dataset API parity with classic mode. Here's the architecture, how to wire it up from sbt, and what still doesn't work.
-
The VARIANT Data Type in Spark 4.0: Semi-Structured Data Without Schema Headaches
Spark 4.0 added a native VARIANT type (SPARK-45827) for storing semi-structured data in a compact binary format you can query directly — no upfront schema, no from_json ceremony on every read. The published benchmarks show roughly 8x faster reads than storing the same payload as a JSON string column, and Spark 4.1 adds shredding to push that further. This article shows how the Scala API works, when to reach for VARIANT, and when you still want a strongly typed StructType.
-
ANSI Mode by Default in Spark 4.0: What Breaks and How to Fix It
Spark 4.0 flipped spark.sql.ansi.enabled from false to true, so invalid casts, arithmetic overflow, divide-by-zero, and bad array indices that used to silently return null now throw runtime errors. This guide catalogs each failure mode with the exception you'll see and the try_* function that fixes it without falling back to legacy mode.