Upgrading from Spark 3.x to Spark 4.0: A Practical Guide
Spark 4.0 brings real breaking changes that will likely affect your existing Scala pipelines — ANSI mode on by default, Scala 2.12 dropped, JDK 17 required, and infrastructure changes to shuffle and event logging. This guide walks through each one with before/after context and the config knob to fall back if you need time to migrate.
If you want a feature overview first, see What's New in Spark 4.0 for Scala Developers. This guide is the migration companion — focused on what breaks and how to fix it.
Step 1: Check Your Prerequisites
Before anything else, two hard requirements:
Scala 2.13 only. Spark 4.0 drops Scala 2.12 entirely. There's no 2.12 build. If you haven't migrated yet, you need to do that first. The main pain points are JavaConverters → CollectionConverters and collection API changes (.to[List] → .to(List)). For most codebases this is mechanical, but it takes time.
JDK 17 minimum. JDK 8 and JDK 11 are gone. JDK 21 is also supported. Upgrade your build infrastructure, CI, and runtime environments before attempting a Spark upgrade.
// build.sbt — update your Scala version
scalaVersion := "2.13.16"
// If you use parallel collections (e.g. .par), add this dependency:
libraryDependencies += "org.scala-lang.modules" %% "scala-parallel-collections" % "1.2.0"
Check your CI configuration, Dockerfiles, and any deployment scripts that pin a JDK version. Getting the JDK and Scala version wrong will fail at build time, which is the right time to catch it.
Step 2: Handle ANSI Mode
This is the change most likely to break a working pipeline. In Spark 3.x, spark.sql.ansi.enabled was false. In Spark 4.0, it's true.
With ANSI mode on, operations that used to silently return null or produce garbage results will now throw runtime errors:
// Spark 3.x: Returns null silently
// Spark 4.0: Throws SparkArithmeticException
spark.sql("SELECT CAST('not-a-number' AS INT)")
// Spark 3.x: Integer wraps around silently
// Spark 4.0: Throws SparkArithmeticException
spark.sql("SELECT 2147483647 + 1")
// Spark 3.x: Division by zero returns null
// Spark 4.0: Throws SparkArithmeticException
spark.sql("SELECT 1 / 0")
The safest migration path is to disable ANSI mode first, get the application running on Spark 4.0, then re-enable it and work through the failures:
// In SparkSession config — disable ANSI mode temporarily during migration
val spark = SparkSession.builder()
.config("spark.sql.ansi.enabled", "false")
.getOrCreate()
// Or set it at runtime
spark.conf.set("spark.sql.ansi.enabled", "false")
Once you re-enable ANSI mode, treat every error as a data quality issue that was previously being silently swallowed. Integer overflow returning a garbage value is worse than a thrown exception — the exception tells you where the problem is.
Step 3: Audit Your SQL for Other Breaking Changes
Several other SQL behaviors changed defaults or were removed:
CTE precedence
The default for spark.sql.legacy.ctePrecedencePolicy changed from EXCEPTION to CORRECTED. Inner CTE definitions now shadow outer ones. If you had queries that relied on the outer CTE winning, behavior silently changes:
-- In Spark 4.0, the inner 'data' CTE takes precedence over the outer one
WITH data AS (SELECT 1 AS n)
SELECT * FROM (
WITH data AS (SELECT 99 AS n)
SELECT * FROM data -- returns 99, not 1
)
-- To restore old behavior (throws an error on ambiguity):
-- spark.conf.set("spark.sql.legacy.ctePrecedencePolicy", "EXCEPTION")
format_string index base
format_string() and printf() no longer accept zero-based indexes. If you have any %0$s style format strings, update them to %1$s:
// Spark 3.x: Works (zero-based index)
spark.sql("SELECT format_string('%0$s and %1$s', 'foo', 'bar')")
// Spark 4.0: Throws AnalysisException
// Fix: use 1-based indexes
spark.sql("SELECT format_string('%1$s and %2$s', 'foo', 'bar')")
ORC compression default
ORC files now default to zstd compression instead of snappy. If downstream systems read your ORC files and expect snappy, you'll need to set this explicitly:
spark.conf.set("spark.sql.orc.compression.codec", "snappy")
Parquet codec name
If you use LZ4 compression for Parquet files, the codec name changed:
// Spark 3.x
spark.conf.set("spark.sql.parquet.compression.codec", "lz4raw")
// Spark 4.0 — use lz4_raw (with underscore)
spark.conf.set("spark.sql.parquet.compression.codec", "lz4_raw")
Map key normalization
create_map, map_from_arrays, and map_from_entries now normalize -0.0 to 0.0 by default. If your maps use negative zero as a distinct key, this changes behavior silently:
// Spark 3.x: -0.0 and 0.0 are distinct keys
// Spark 4.0: -0.0 is normalized to 0.0
// To restore old behavior:
spark.conf.set("spark.sql.legacy.disableMapKeyNormalization", "true")
JDBC type mapping changes
If you read from relational databases via the JDBC connector, type mappings changed for PostgreSQL, MySQL, MsSQL Server, DB2, and Oracle. The most common surprises:
- PostgreSQL:
TIMESTAMP WITH TIME ZONEnow reads asTimestampType(wasTimestampNTZType) - MySQL:
FLOATnow reads asFloatType(wasDoubleType);SMALLINTasShortType(wasIntegerType) - Oracle:
TimestampTypenow writes asTIMESTAMP WITH LOCAL TIME ZONE(wasTIMESTAMP)
// PostgreSQL — restore old timestamp mapping if needed
spark.conf.set("spark.sql.legacy.postgres.datetimeMapping.enabled", "true")
For MySQL and other databases, check your schemas against the SQL migration guide and run your data pipeline tests before treating this as production-ready.
Step 4: Infrastructure Changes
Mesos is removed
If you're running Spark on Mesos, that's done. Spark 4.0 removes Mesos support entirely. Migrate to YARN, Kubernetes, or Spark Standalone before upgrading. There's no compatibility shim or fallback.
Shuffle service now uses RocksDB by default
The external shuffle service database backend changed from LevelDB to RocksDB:
# spark-defaults.conf — to keep LevelDB during migration
spark.shuffle.service.db.backend=LEVELDB
The RocksDB backend is more performant and reliable, but verify your shuffle service configuration before deploying. This only matters if you use the external shuffle service (most YARN and Kubernetes deployments do).
Event log rolling and compression
Two event log defaults changed:
# Both are now true by default in Spark 4.0
# spark.eventLog.rolling.enabled=true
# spark.eventLog.compress=true
# To restore pre-4.0 behavior:
spark.eventLog.rolling.enabled=false
spark.eventLog.compress=false
Rolling event logs are better for long-running applications — they don't grow unbounded. But if your Spark History Server or monitoring tooling reads event logs and hasn't been updated to handle rolled/compressed logs, check compatibility first.
Jakarta servlet migration
Spark's internal servlet API moved from javax to jakarta. This only affects you if you've written custom code that extends Spark's web UI, custom REST endpoints, or any code that touches servlet APIs:
// Spark 3.x
import javax.servlet.http.HttpServletRequest
import javax.servlet.http.HttpServletResponse
// Spark 4.0
import jakarta.servlet.http.HttpServletRequest
import jakarta.servlet.http.HttpServletResponse
Check your transitive dependencies too. Libraries that declare a javax.servlet dependency may need updated versions that declare jakarta.servlet instead. Running sbt dependencyTree and grepping for javax.servlet will surface the issues.
Step 5: Dependency and Classpath Changes
Spark 4.0 bumps major dependencies significantly. The ones most likely to cause classpath issues:
| Dependency | Spark 3.x | Spark 4.0 |
|---|---|---|
| Guava | 14.0.1 | 33.4.0-jre |
| Jackson | 2.15.2 | 2.18.2 |
| Arrow | 12.0.1 | 18.1.0 |
| Parquet | 1.13.1 | 1.15.2 |
| Hadoop | 3.3.4 | 3.4.1 |
| Netty | 4.1.96 | 4.1.118 |
The Guava jump is the most dangerous — from 14.x to 33.x. If you shade Guava, update your shading rules. If you depend on Guava APIs that were removed or moved between those versions, you'll get NoSuchMethodError or ClassNotFoundException at runtime.
Jackson: If you shade or explicitly depend on Jackson, update your shading rules to match 2.18.x. Jackson 2.15 to 2.18 has deprecations but no major breaking API changes.
Hadoop: If you run against a specific Hadoop cluster, verify client compatibility. Spark 4.0 ships with Hadoop 3.4.1 as its default client but generally maintains compatibility with older cluster versions through the HADOOP_VERSION build option.
// Check your fat jar for classpath conflicts after upgrading
// sbt-assembly users — verify shading rules still apply
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.common.**" -> "shaded.google.common.@1").inAll
)
Migration Checklist
Work through this in order:
- [ ] Scala 2.13 — migrate from 2.12 if needed (
JavaConverters→CollectionConverters, collection API changes) - [ ] JDK 17 — update build, CI, and runtime environments
- [ ] Disable ANSI mode — set
spark.sql.ansi.enabled=falseinitially to unblock the upgrade - [ ] Run your test suite on Spark 4.0 with ANSI mode off — fix compilation issues first
- [ ] Re-enable ANSI mode — work through failures as real data quality issues
- [ ] Check CTE queries — inner CTEs now shadow outer ones by default
- [ ] Check format_string calls — update zero-based
%0$sindexes to%1$s - [ ] Check ORC output — set
snappyexplicitly if downstream readers expect it - [ ] Check Parquet codec names —
lz4raw→lz4_raw - [ ] Check JDBC type mappings — verify schemas for PostgreSQL, MySQL, MsSQL, Oracle reads
- [ ] Migrate off Mesos — move to YARN, Kubernetes, or Standalone
- [ ] Check shuffle service — verify RocksDB backend works in your environment
- [ ] Check event log tooling — History Server and monitoring must handle rolled/compressed logs
- [ ] Update servlet imports —
javax.servlet→jakarta.servletin custom code - [ ] Audit fat jar / shading rules — especially for Guava 14 → 33 and Jackson 2.15 → 2.18
The ANSI mode change and the Scala/JDK version requirements are the real gating items. Everything else is configuration and import-level changes. If you're already on Scala 2.13 and JDK 17, the path through is clear — disable ANSI mode, upgrade, re-enable, fix.
For the full authoritative list of changes, see the Spark 4.0 release notes, the SQL migration guide, and the core migration guide.