Using provided Scope for Spark Dependencies in sbt
Spark dependencies belong in provided scope. The cluster already has them — bundling them into your fat jar wastes space, causes version conflicts, and can break your application at runtime. Here's how provided works in sbt and what to watch out for.
What provided Scope Does
When you mark a dependency as provided, sbt includes it on the compile and test classpaths but excludes it from the packaged artifact. Your code compiles against the library, your tests can use it, but sbt assembly (or sbt package) won't bundle it into the output jar.
// build.sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.4.1" % "provided",
)
Compare this to the default Compile scope:
// build.sbt — DON'T do this for cluster deployments
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.4.1",
)
Without provided, sbt-assembly pulls in Spark and all of its transitive dependencies — Hadoop, Hive, Netty, Guava, Protobuf, and hundreds more. Your fat jar balloons from a few megabytes to over 200MB, and you'll fight merge strategy conflicts for every overlapping file.
Why the Cluster Already Has Spark
When you submit an application with spark-submit, the driver and executors run inside a JVM that already has Spark on its classpath. The cluster (whether it's YARN, Kubernetes, or standalone) provides these jars. Your application jar only needs to contain your code and any dependencies that aren't already on the cluster.
If your fat jar also contains Spark, you end up with two copies of the same classes on the classpath — your bundled version and the cluster's version. This leads to:
LinkageErrororNoSuchMethodErrorwhen your bundled Spark version doesn't exactly match the cluster's- Classloader conflicts where the JVM loads a class from the wrong jar
- Wasted bandwidth and storage uploading and distributing a jar that's 10-50x larger than it needs to be
The provided scope exists precisely for this situation: "I need this to compile, but someone else will supply it at runtime."
A Complete build.sbt with provided
// build.sbt
name := "myproject"
scalaVersion := "2.13.11"
val sparkVersion = "3.4.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"com.lihaoyi" %% "utest" % "0.8.1" % "test",
)
testFrameworks += new TestFramework("utest.runner.Framework")
Every org.apache.spark dependency should be provided. If you use spark-sql, spark-mllib, spark-streaming, or spark-hive — mark them all. Mixing scopes (some provided, some not) causes exactly the kind of version conflict you're trying to avoid.
Testing with provided Dependencies
A common concern: if provided dependencies aren't in the assembled jar, will they be available during sbt test? Yes. sbt's provided scope still includes the dependency on the test classpath. Your tests compile and run normally.
// build.sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.4.1" % "provided",
"com.lihaoyi" %% "utest" % "0.8.1" % "test",
)
// Tests still have access to SparkSession, DataFrame, etc.
Test / fork := true
Test / javaOptions ++= Seq("-Xms512m", "-Xmx4g")
The provided scope affects packaging, not compilation or test execution. If your tests create a SparkSession in local mode, everything works exactly as if the dependency were in Compile scope — because it's still on the classpath when sbt runs your tests.
For more on configuring your test JVM for Spark, see setting JVM options to avoid test OOMs.
When NOT to Use provided
There are two situations where you should leave Spark dependencies in Compile scope (the default):
1. Self-contained applications run outside a cluster. If you're building a standalone tool that creates its own SparkSession in local mode and runs with java -jar, there's no cluster to supply the Spark jars. You need them in your fat jar.
2. Local development with sbt run. The sbt run task doesn't include provided dependencies on the classpath. If you run your application directly through sbt during development, you'll get ClassNotFoundException for Spark classes. For development workflows that depend on sbt run, keep Spark in Compile scope and switch to provided when building for the cluster.
In practice, most Spark teams don't use sbt run — they either run tests or submit to a cluster. If you do need sbt run locally, you can add a workaround:
// build.sbt
Compile / run := Defaults.runTask(
Compile / fullClasspath,
Compile / run / mainClass,
Compile / run / runner,
).evaluated
This overrides the run task to include the full classpath (which contains provided dependencies) instead of the default runtime classpath. Tests and assembly remain unaffected.
What About sbt package vs sbt assembly?
Both sbt package and sbt assembly respect provided scope — neither will include provided dependencies in the output jar. The difference is in how they handle your other (non-provided) dependencies:
sbt packageproduces a thin jar with only your compiled classes. You manage the classpath yourself when submitting (via--jarsor--packagesflags tospark-submit).sbt assemblyproduces a fat jar with your classes and all non-provided dependencies bundled together. The cluster needs to supply only theprovidedjars.
For most teams, sbt assembly with provided Spark dependencies is the simplest workflow: one self-contained jar, minus the jars the cluster already has. If your assembly build hits merge conflicts, see configuring merge strategies for the fix.
Verifying What's in Your Jar
After building, check that Spark classes aren't in the output:
// After sbt assembly, inspect the jar contents
jar tf target/scala-2.13/myproject-assembly-0.1.0-SNAPSHOT.jar | grep "org/apache/spark"
// (no output — Spark classes are excluded)
If you see Spark classes in the output, one of your Spark dependencies isn't marked as provided. Check sbt dependencyTree (from the sbt-dependency-graph plugin) to find which dependency is pulling in Spark transitively.