Job Board
Consulting

Spark 4.2 Preview: What's Coming Next

The Apache Spark community has published three Spark 4.2.0 preview releases — the most recent on March 12, 2026 — for community testing ahead of the stable release. Here's what the preview docs and JIRA tell us about what's coming, and how to try it today.

What Is a Preview Release?

The official word from the Spark project is explicit: preview releases are "not a stable release in terms of either API or functionality." APIs and features you see in preview3 may change before the final 4.2.0 ships. Do not run these in production.

What they are good for: running your own Spark jobs against the preview build, validating that your application code compiles and runs correctly, and surfacing regressions before the stable release. The Spark team actively wants this feedback — send it via the dev mailing list or file a JIRA.

The preview3 docs are available at spark.apache.org/docs/4.2.0-preview3/ and the preview2 binary is at dist.apache.org/repos/dist/release/spark/spark-4.2.0-preview2.

How to Try the Preview

Swap your Spark dependency version in sbt and run your tests:

// build.sbt — try the preview build
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core" % "4.2.0-preview2",
  "org.apache.spark" %% "spark-sql"  % "4.2.0-preview2"
)

The preview artifacts are published to Apache's dist server, not Maven Central, so you'll also need the resolver:

// build.sbt — add the Apache dist resolver
resolvers += "Apache Releases" at "https://dist.apache.org/repos/dist/release/maven/apache-spark-4.2.0-preview2"

If you find a regression, file it at issues.apache.org/jira with the fixVersion set to 4.2.0 so it gets triaged.

Shuffle Correctness: Order-Independent Checksums Now On By Default

The most notable Scala-relevant change in the 4.2 migration guide is a correctness improvement: order-independent checksums for shuffle outputs are now enabled by default.

What this addresses: when an indeterminate shuffle stage is retried (due to executor failure, for example), the output data may arrive in a different order across retries. If downstream stages consumed partial results from the first attempt and then need to recompute from the second, there's a window where inconsistent data could produce incorrect results in edge cases.

With checksums enabled, Spark detects this inconsistency. When a mismatch is detected, it rolls back dependent stages and re-executes, or fails the job if rollback isn't possible.

For most jobs this is invisible — retries are uncommon on stable clusters. But if you're running long-running jobs on spot/preemptible instances where executor failures happen regularly, this is a meaningful correctness improvement.

If you see unexpected job failures after upgrading and suspect the checksum is a false positive, you can disable it temporarily:

// Disable order-independent checksum validation (temporary workaround only)
spark.conf.set("spark.sql.shuffle.orderIndependentChecksum.enabled", "false")

That config exists as an escape hatch. If you need it, file a JIRA.

Performance: Subexpression Elimination in FilterExec

SPARK-56032 adds subexpression elimination to FilterExec. If your filter predicates evaluate the same expression multiple times — common when you have complex conditions that reference the same function or column transformation more than once — Spark will now compute it once and reuse the result rather than evaluating it redundantly.

This is a transparent optimizer improvement. No code change required. Jobs with complex filter predicates involving repeated expressions stand to benefit most.

Query Optimization: CBO Propagates DistinctCount Through Union

SPARK-56047 improves the Cost-Based Optimizer's cardinality estimates for queries involving UNION. Previously, the CBO could lose track of distinct count statistics across union operations, leading to suboptimal join order and plan choices downstream. With this fix, distinct count estimates propagate correctly through union nodes.

In practice this matters most for complex queries that combine multiple data sources with UNION or UNION ALL before joining against another table. If the CBO was picking a bad join strategy on those queries, this may help.

Structured Streaming: LowLatencyMemoryStream Load Balancing

SPARK-56023 improves load balancing in LowLatencyMemoryStream, the in-memory stream used by real-time mode tests and benchmarks. This is primarily relevant for developers testing Spark 4.1's real-time streaming mode — it ensures more even distribution of records across partitions during low-latency microbench scenarios.

Not a production workload change, but worth knowing if you're running benchmarks against real-time mode.

Kubernetes: Tighter Defaults

Two Kubernetes-related defaults change in 4.2:

NetworkPolicy enabled by default. Executor pods now have a NetworkPolicy applied by default, restricting them to only accept ingress traffic from the driver and peer executors within the same job. This is a security hardening change. If you run Spark on Kubernetes and your executors need to receive inbound connections from other services (uncommon, but possible in some sidecar setups), this change may break things. You can restore the previous behavior by excluding the NetworkPolicyFeatureStep from the pod template feature steps.

Executor pod allocation batch size raised from 10 to 20. When Spark requests executor pods from Kubernetes, it batches the requests to avoid overwhelming the API server. The batch size doubles to 20, which means faster initial scale-out on larger clusters. Lower-risk change — just faster provisioning.

PySpark: Arrow On By Default

Several PySpark defaults change in 4.2 that are worth knowing if your Scala jobs run alongside Python code or if you have a mixed Spark deployment:

  • Arrow-based columnar data exchange is now default-on (spark.sql.execution.arrow.pyspark.enabled=true)
  • Arrow-optimized Python UDFs are default-on (spark.sql.execution.pythonUDF.arrow.enabled=true)
  • Arrow-optimized Python UDTFs are default-on (spark.sql.execution.pythonUDTF.arrow.enabled=true)
  • PyArrow minimum version raised to 18.0.0

For pure Scala jobs, these changes are irrelevant. For teams running both Scala and Python jobs on shared clusters, or using PySpark alongside Scala applications, verify that your Python environments have PyArrow >= 18.0.0 before upgrading.

Also: Derby JDBC datasource support is deprecated in 4.2. If anything in your stack uses the Derby JDBC connector with Spark, plan to migrate.

Dependency Updates

Notable library bumps in the JIRA backlog for 4.2:

  • Jackson: 2.21.2
  • Netty-tcnative: 2.0.75.Final
  • Maven (internal build): 3.9.14

The Jackson bump is the one most likely to surface in Spark applications that use Jackson directly (via fasterxml imports). Check for classpath conflicts if you have a fixed Jackson version in your own build.

What to Watch Before the Stable Release

Given that the 4.2 preview cycle is well underway (three previews across 10 weeks), the stable release is likely coming in Q2 2026 — but that's an inference, not an official date. The Spark project doesn't publish release schedules.

A few things to monitor before committing to an upgrade plan:

  • Shuffle checksum behavior on retried jobs. The new default correctness check is the biggest behavioral change for Scala jobs. Test it on your retry-heavy workloads.
  • Any Structured Streaming changes that land before stable. The preview3 migration guide lists nothing streaming-specific for 4.2 yet, but that could change.
  • Plugin and library compatibility. If you use custom Spark listeners, external data sources, or third-party connectors, verify they compile and run against the preview build.

If you're currently on Spark 4.1 and running smoothly, there's no urgency — 4.2 looks like an incremental release, not a major one. The 4.1 upgrade path is where the bigger features landed.

If you're still on Spark 3.x, the 3.x → 4.x jump is the one that needs planning. See the Spark 3 to 4 migration guide — upgrading to 4.2 when it ships will be the same work as upgrading to 4.1 today.

Watch the official Spark news page for the stable release announcement.

Article Details

Created: 2026-03-23

Last Updated: 2026-03-23 11:31:18 PM