Configuring sbt Assembly Merge Strategies for Spark
Building a fat jar with sbt-assembly for a Spark project almost always hits duplicate file errors. Spark pulls in hundreds of transitive dependencies, and many of them bundle overlapping META-INF files, service descriptors, and even classes. Merge strategies tell sbt-assembly how to resolve these conflicts.
The Problem
You add the sbt-assembly plugin, run sbt assembly, and get something like:
// sbt assembly output
[error] deduplicate: different file contents found in the following:
[error] .../spark-core_2.13-3.4.1.jar:META-INF/MANIFEST.MF
[error] .../hadoop-client-api-3.3.4.jar:META-INF/MANIFEST.MF
[error] .../commons-lang3-3.12.0.jar:META-INF/MANIFEST.MF
[error] deduplicate: different file contents found in the following:
[error] .../spark-core_2.13-3.4.1.jar:META-INF/NOTICE
[error] .../hadoop-client-api-3.3.4.jar:META-INF/NOTICE
This happens because multiple jars contain files at the same path — different MANIFEST.MF files, different NOTICE files, different LICENSE files. sbt-assembly doesn't know which one to keep, so it fails.
The Basics: assembly / assemblyMergeStrategy
The assemblyMergeStrategy setting is a function from file path (String) to merge strategy (MergeStrategy). When sbt-assembly finds duplicate files at the same path, it calls this function to decide what to do.
The built-in strategies are:
| Strategy | Behavior |
|---|---|
MergeStrategy.first |
Keep the first file found, discard the rest |
MergeStrategy.last |
Keep the last file found, discard the rest |
MergeStrategy.concat |
Concatenate all files together |
MergeStrategy.filterDistinctLines |
Concatenate but remove duplicate lines |
MergeStrategy.discard |
Throw away all copies |
MergeStrategy.deduplicate |
Keep one copy if all are identical, fail if they differ (the default) |
MergeStrategy.singleOrError |
Fail if more than one file exists |
MergeStrategy.rename |
Rename duplicate files to avoid conflicts |
You configure it in build.sbt:
// build.sbt
assembly / assemblyMergeStrategy := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
}
The case expressions pattern-match on the file path inside the jar. PathList splits the path into segments, so PathList("META-INF", xs @ _*) matches anything under META-INF/. The fallback calls the default strategy for everything else.
A Practical Merge Strategy for Spark
The blanket "discard all META-INF" approach works for many projects, but it's too aggressive for some. Service provider files under META-INF/services/ need to be concatenated — they register implementations that libraries discover at runtime. Discarding them silently breaks things like custom filesystem implementations or serialization providers.
Here's a more precise configuration:
// build.sbt
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", _*) => MergeStrategy.concat
case PathList("META-INF", _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
}
What each rule does:
META-INF/services/*— Concatenated. These files list service provider implementations. Multiple jars can each contribute implementations of the same interface, and the JVM'sServiceLoaderreads all of them. Discarding these can causeClassNotFoundExceptionat runtime when a library tries to load a provider that was silently dropped.META-INF/*— Discarded.MANIFEST.MF,LICENSE,NOTICE,DEPENDENCIES,.SF,.DSA,.RSA— none of these affect runtime behavior. The order matters: theservicesrule must come first since pattern matching evaluates top to bottom.reference.conf— Concatenated. Libraries that use Typesafe Config (like Akka, which Spark uses internally) define defaults inreference.conf. Each library's defaults need to be merged into a single file soConfigFactory.load()picks them all up.
Handling Class File Duplicates
Sometimes the conflict isn't in META-INF but in actual .class files. This happens when two dependencies bundle the same class — often shaded or relocated copies of common libraries like Guava or commons-io.
// sbt assembly output
[error] deduplicate: different file contents found in the following:
[error] .../guava-14.0.1.jar:com/google/common/base/Strings.class
[error] .../guava-27.0-jre.jar:com/google/common/base/Strings.class
This is a different problem. Two versions of Guava ended up on the classpath, and they contain different bytecode for the same class. Using MergeStrategy.first here silences the error but might cause runtime failures if the wrong version wins.
The better fix is to resolve the version conflict at the dependency level:
// build.sbt
dependencyOverrides += "com.google.guava" % "guava" % "27.0-jre"
dependencyOverrides forces sbt to use a single version of the library across the entire dependency tree. This eliminates the duplicate class before sbt-assembly even sees it.
Use MergeStrategy.first for class conflicts only as a last resort — when you've confirmed the classes are functionally identical or when you can't resolve the version conflict upstream.
Spark with provided Scope
If your Spark dependencies are marked as provided (which they should be for cluster deployments), most of these conflicts go away. The Spark jars and their transitive dependencies aren't included in the fat jar at all — they're supplied by the cluster at runtime.
// build.sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % "3.4.1" % "provided",
)
With provided scope, your fat jar contains only your code and your non-Spark dependencies. This dramatically reduces both jar size and merge conflicts. The merge strategy still matters for conflicts among your remaining dependencies, but you're dealing with far fewer of them.
Putting It Together
A complete build.sbt for a Spark project that builds a fat jar:
// build.sbt
name := "myproject"
scalaVersion := "2.13.11"
val sparkVersion = "3.4.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"com.lihaoyi" %% "utest" % "0.8.1" % "test",
)
testFrameworks += new TestFramework("utest.runner.Framework")
assembly / assemblyMergeStrategy := {
case PathList("META-INF", "services", _*) => MergeStrategy.concat
case PathList("META-INF", _*) => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x =>
val oldStrategy = (assembly / assemblyMergeStrategy).value
oldStrategy(x)
}
This handles the vast majority of Spark assembly conflicts. If you hit a new conflict, the error message tells you exactly which files collide — add a targeted rule for that path rather than broadening an existing one.
Debugging Merge Conflicts
When you hit a new conflict and aren't sure which strategy to use, check what the conflicting files actually contain:
// Run from your project root
jar tf path/to/dependency.jar | grep "conflicting/path"
If the files are identical, deduplicate (the default) would have passed — so the files differ. Look at the content to decide:
- Text files with additive content (service registrations, config) →
concat - Metadata you don't need (manifests, licenses, signatures) →
discard - Class files → Fix the version conflict with
dependencyOverridesfirst, fall back tofirstif needed
Avoid using first or last as a blanket strategy. They silence errors but can hide real problems — like shipping a jar with the wrong version of a transitive class that causes a NoSuchMethodError in production.