What's New in Spark 4.0 for Scala Developers

Spark 4.0 is the biggest release in years — over 5,100 resolved tickets from 390+ contributors. Here's what matters most if you're writing or maintaining Spark Scala applications.

The Headline: ANSI SQL Mode Is Now On by Default

This is the change most likely to break your existing code. In Spark 3.x, spark.sql.ansi.enabled defaulted to false. In Spark 4.0, it's true.

What does that mean in practice? Operations that used to silently return null or wrap around will now throw runtime errors. Integer overflow, invalid casts, division by zero — all of these will fail loudly instead of producing garbage data.

// Spark 3.x: Returns null silently
// Spark 4.0: Throws ArithmeticException
spark.sql("SELECT CAST('not-a-number' AS INT)")

// Spark 3.x: Wraps around silently
// Spark 4.0: Throws ArithmeticException
spark.sql("SELECT 2147483647 + 1")

// To restore old behavior:
spark.conf.set("spark.sql.ansi.enabled", "false")

This is arguably a good change — silent data corruption is worse than a loud failure. But if you have pipelines that depend on the old behavior, you'll need to either fix the queries or set spark.sql.ansi.enabled to false while you migrate.

JDK 17 Minimum, Scala 2.12 Dropped

Spark 4.0 drops support for JDK 8 and JDK 11. The minimum is now JDK 17, with JDK 21 also supported.

Scala 2.12 is gone too. Scala 2.13 is now the only supported version. If you haven't made the 2.12 to 2.13 migration yet, Spark 4.0 forces your hand. Most of the pain points in that migration are around collection API changes (JavaConverters to CollectionConverters, .to[List] to .to(List), etc.), so it's manageable but not trivial for large codebases.

VARIANT Data Type

Spark 4.0 adds a native VARIANT type for semi-structured data. If you've been storing JSON as strings and parsing it on every read, this is a meaningful improvement.

// Store semi-structured data natively
spark.sql("SELECT parse_json('{"name": "spark", "version": 4}') AS data")

// Query nested fields directly
spark.sql("""
  SELECT data:name, data:version
  FROM my_variant_table
""")

VARIANT stores the data in a binary format that's more efficient than reparsing JSON strings on every query. This is particularly useful for event data, API responses, or any dataset where the schema varies across records.

SQL Pipe Syntax

Spark 4.0 introduces a pipe operator (|>) for SQL that lets you chain transformations top-to-bottom instead of nesting subqueries:

// Traditional nested SQL
spark.sql("""
  SELECT category, total
  FROM (
    SELECT category, SUM(amount) AS total
    FROM transactions
    GROUP BY category
  )
  WHERE total > 1000
  ORDER BY total DESC
""")

// Pipe syntax — reads top to bottom
spark.sql("""
  FROM transactions
  |> SELECT category, SUM(amount) AS total GROUP BY category
  |> WHERE total > 1000
  |> ORDER BY total DESC
""")

If you write complex SQL queries, pipe syntax makes them significantly more readable. Each step in the pipeline is clear and the data flow is obvious.

SQL User-Defined Functions

You can now define reusable functions entirely in SQL, without writing Scala or Python:

spark.sql("""
  CREATE FUNCTION discount(price DOUBLE, pct DOUBLE)
  RETURNS DOUBLE
  RETURN price * (1.0 - pct / 100.0)
""")

spark.sql("SELECT discount(99.99, 15) AS sale_price")
// Returns: 84.9915

These are useful for pushing domain logic into SQL where analysts can maintain it. For Scala developers, this means fewer UDF registrations in application code.

Session Variables

Session variables let you parameterize queries without string interpolation:

spark.sql("DECLARE threshold INT = 100")
spark.sql("SET VAR threshold = 500")

spark.sql("""
  SELECT * FROM orders
  WHERE amount > session.threshold
""")

This is cleaner and safer than building SQL strings with interpolated values. The variables are scoped to the session, so there's no cross-query contamination.

String Collation Support

Spark 4.0 adds collation support for string comparisons, so you can do case-insensitive matching at the column level:

spark.sql("""
  CREATE TABLE users (
    name STRING COLLATE UNICODE_CI
  )
""")

// Case-insensitive comparison — no need for lower()/upper()
spark.sql("SELECT * FROM users WHERE name = 'Alice'")
// Matches: 'Alice', 'alice', 'ALICE'

If you've been wrapping every string comparison in lower() for case-insensitive matching, collation support is a cleaner solution.

Spark Connect Java Client: Full API Parity

The Spark Connect Java client now has full API compatibility with the existing Dataset/DataFrame API. This means you can use Spark Connect from Scala with the same API surface you're used to — Dataset.observe(), Dataset.groupingSets(), Dataset.explode(), and DataFrame.mergeInto() all work through Connect now.

The new spark.api.mode configuration lets you switch between Classic and Connect modes easily:

// Switch to Spark Connect mode
spark.conf.set("spark.api.mode", "CONNECT")

This matters for teams adopting Spark Connect for client-server architecture. You no longer have to sacrifice API coverage to use it.

Jakarta Servlet Migration

All internal servlet APIs have moved from javax to jakarta. If you have custom code that extends Spark's web UI, custom REST endpoints, or anything touching servlet APIs, you'll need to update your imports:

// Spark 3.x
import javax.servlet.http.HttpServletRequest

// Spark 4.0
import jakarta.servlet.http.HttpServletRequest

This also affects transitive dependencies. Libraries that depend on javax.servlet may need updated versions that support jakarta.servlet.

Structured Streaming: Arbitrary State API v2

Spark 4.0 introduces a new transformWithState operator with a more flexible state management API. You get MapState, ListState, timer support, and TTL-based expiration:

// New state management with transformWithState
// Supports MapState, ListState, timers, and TTL
dataset
  .groupByKey(row => row.getString(0))
  .transformWithState(
    new StatefulProcessor[String, Row, Output] {
      // MapState and ListState for complex state
      // Timer support for delayed processing
      // TTL for automatic state expiration
    },
    TimeMode.ProcessingTime(),
    OutputMode.Append()
  )

If you've been fighting with flatMapGroupsWithState and its limitations, the v2 API is a significant improvement.

Notable Default Changes

Several defaults changed that could silently affect behavior:

Setting	Spark 3.x	Spark 4.0
`spark.sql.ansi.enabled`	`false`	`true`
`spark.sql.orc.compression.codec`	`snappy`	`zstd`
`spark.shuffle.service.db.backend`	`LEVELDB`	`ROCKSDB`
`spark.eventLog.compress`	`false`	`true`
`spark.eventLog.rolling.enabled`	`false`	`true`
`spark.sql.maxSinglePartitionBytes`	`Long.MaxValue`	`128m`
`spark.speculation.multiplier`	`1.5`	`3`
`spark.speculation.quantile`	`0.75`	`0.9`

Most of these are sensible improvements, but the ORC compression change could affect downstream systems that expect snappy-compressed files. And the maxSinglePartitionBytes change from unbounded to 128MB could affect partition sizing for large datasets.

Mesos Support Removed

If you're running Spark on Mesos — it's time to migrate. Mesos support is completely removed in Spark 4.0. Your options are Kubernetes, YARN, or Spark Standalone.

Better Error Messages

Spark 4.0 adds SQLSTATE error codes and improved error context. When a DataFrame operation fails, you now get a DataFrameQueryContext that points to the specific operation that caused the error, not just a stack trace deep in Catalyst internals. This is a quality-of-life improvement that makes debugging significantly easier.

Major Library Upgrades

Spark 4.0 bumps nearly every major dependency. The ones most likely to affect Scala developers:

Guava: 14.0.1 to 33.4.0 — this is a massive jump that can cause classpath conflicts
Jackson: 2.15.2 to 2.18.2 — if you shade Jackson, update your shading rules
Arrow: 12.0.1 to 18.1.0
Parquet: 1.13.1 to 1.15.2
Hadoop: check your Hadoop client compatibility

If you use fat jars or shade dependencies, test thoroughly. The Guava jump alone can cause subtle serialization issues.

Should You Upgrade?

If you're starting a new project — absolutely, use Spark 4.0. The JDK 17+ requirement, ANSI mode defaults, and VARIANT type are all positive changes.

If you're maintaining an existing Spark 3.x application, the answer depends on your situation. The ANSI mode change and Scala 2.12 drop are the biggest barriers. If you're already on Scala 2.13 and JDK 17, the upgrade is straightforward. If you're still on Scala 2.12 or JDK 11, you have prerequisite work to do first.

Either way, don't rush it. Set spark.sql.ansi.enabled to false initially, run your test suite, and work through the failures methodically. We'll have a detailed migration guide covering the step-by-step process coming soon.

For the full list of changes, see the official Spark 4.0 release notes, the SQL migration guide, and the core migration guide.