Configuring sbt Assembly Merge Strategies for Spark

Building a fat jar with sbt-assembly for a Spark project almost always hits duplicate file errors. Spark pulls in hundreds of transitive dependencies, and many of them bundle overlapping META-INF files, service descriptors, and even classes. Merge strategies tell sbt-assembly how to resolve these conflicts.

The Problem

You add the sbt-assembly plugin, run sbt assembly, and get something like:

// sbt assembly output
[error] deduplicate: different file contents found in the following:
[error]   .../spark-core_2.13-3.4.1.jar:META-INF/MANIFEST.MF
[error]   .../hadoop-client-api-3.3.4.jar:META-INF/MANIFEST.MF
[error]   .../commons-lang3-3.12.0.jar:META-INF/MANIFEST.MF
[error] deduplicate: different file contents found in the following:
[error]   .../spark-core_2.13-3.4.1.jar:META-INF/NOTICE
[error]   .../hadoop-client-api-3.3.4.jar:META-INF/NOTICE

This happens because multiple jars contain files at the same path — different MANIFEST.MF files, different NOTICE files, different LICENSE files. sbt-assembly doesn't know which one to keep, so it fails.

The Basics: assembly / assemblyMergeStrategy

The assemblyMergeStrategy setting is a function from file path (String) to merge strategy (MergeStrategy). When sbt-assembly finds duplicate files at the same path, it calls this function to decide what to do.

The built-in strategies are:

Strategy	Behavior
`MergeStrategy.first`	Keep the first file found, discard the rest
`MergeStrategy.last`	Keep the last file found, discard the rest
`MergeStrategy.concat`	Concatenate all files together
`MergeStrategy.filterDistinctLines`	Concatenate but remove duplicate lines
`MergeStrategy.discard`	Throw away all copies
`MergeStrategy.deduplicate`	Keep one copy if all are identical, fail if they differ (the default)
`MergeStrategy.singleOrError`	Fail if more than one file exists
`MergeStrategy.rename`	Rename duplicate files to avoid conflicts

You configure it in build.sbt:

// build.sbt
assembly / assemblyMergeStrategy := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x =>
    val oldStrategy = (assembly / assemblyMergeStrategy).value
    oldStrategy(x)
}

The case expressions pattern-match on the file path inside the jar. PathList splits the path into segments, so PathList("META-INF", xs @ _*) matches anything under META-INF/. The fallback calls the default strategy for everything else.

A Practical Merge Strategy for Spark

The blanket "discard all META-INF" approach works for many projects, but it's too aggressive for some. Service provider files under META-INF/services/ need to be concatenated — they register implementations that libraries discover at runtime. Discarding them silently breaks things like custom filesystem implementations or serialization providers.

Here's a more precise configuration:

// build.sbt
assembly / assemblyMergeStrategy := {
  case PathList("META-INF", "services", _*)  => MergeStrategy.concat
  case PathList("META-INF", _*)              => MergeStrategy.discard
  case "reference.conf"                      => MergeStrategy.concat
  case x =>
    val oldStrategy = (assembly / assemblyMergeStrategy).value
    oldStrategy(x)
}

What each rule does:

META-INF/services/* — Concatenated. These files list service provider implementations. Multiple jars can each contribute implementations of the same interface, and the JVM's ServiceLoader reads all of them. Discarding these can cause ClassNotFoundException at runtime when a library tries to load a provider that was silently dropped.
META-INF/* — Discarded. MANIFEST.MF, LICENSE, NOTICE, DEPENDENCIES, .SF, .DSA, .RSA — none of these affect runtime behavior. The order matters: the services rule must come first since pattern matching evaluates top to bottom.
reference.conf — Concatenated. Libraries that use Typesafe Config (like Akka, which Spark uses internally) define defaults in reference.conf. Each library's defaults need to be merged into a single file so ConfigFactory.load() picks them all up.

Handling Class File Duplicates

Sometimes the conflict isn't in META-INF but in actual .class files. This happens when two dependencies bundle the same class — often shaded or relocated copies of common libraries like Guava or commons-io.

// sbt assembly output
[error] deduplicate: different file contents found in the following:
[error]   .../guava-14.0.1.jar:com/google/common/base/Strings.class
[error]   .../guava-27.0-jre.jar:com/google/common/base/Strings.class

This is a different problem. Two versions of Guava ended up on the classpath, and they contain different bytecode for the same class. Using MergeStrategy.first here silences the error but might cause runtime failures if the wrong version wins.

The better fix is to resolve the version conflict at the dependency level:

// build.sbt
dependencyOverrides += "com.google.guava" % "guava" % "27.0-jre"

dependencyOverrides forces sbt to use a single version of the library across the entire dependency tree. This eliminates the duplicate class before sbt-assembly even sees it.

Use MergeStrategy.first for class conflicts only as a last resort — when you've confirmed the classes are functionally identical or when you can't resolve the version conflict upstream.

Spark with provided Scope

If your Spark dependencies are marked as provided (which they should be for cluster deployments), most of these conflicts go away. The Spark jars and their transitive dependencies aren't included in the fat jar at all — they're supplied by the cluster at runtime.

// build.sbt
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "3.4.1" % "provided",
)

With provided scope, your fat jar contains only your code and your non-Spark dependencies. This dramatically reduces both jar size and merge conflicts. The merge strategy still matters for conflicts among your remaining dependencies, but you're dealing with far fewer of them.

Putting It Together

A complete build.sbt for a Spark project that builds a fat jar:

// build.sbt
name := "myproject"

scalaVersion := "2.13.11"
val sparkVersion = "3.4.1"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
  "com.lihaoyi" %% "utest" % "0.8.1" % "test",
)

testFrameworks += new TestFramework("utest.runner.Framework")

assembly / assemblyMergeStrategy := {
  case PathList("META-INF", "services", _*)  => MergeStrategy.concat
  case PathList("META-INF", _*)              => MergeStrategy.discard
  case "reference.conf"                      => MergeStrategy.concat
  case x =>
    val oldStrategy = (assembly / assemblyMergeStrategy).value
    oldStrategy(x)
}

This handles the vast majority of Spark assembly conflicts. If you hit a new conflict, the error message tells you exactly which files collide — add a targeted rule for that path rather than broadening an existing one.

Debugging Merge Conflicts

When you hit a new conflict and aren't sure which strategy to use, check what the conflicting files actually contain:

// Run from your project root
jar tf path/to/dependency.jar | grep "conflicting/path"

If the files are identical, deduplicate (the default) would have passed — so the files differ. Look at the content to decide:

Text files with additive content (service registrations, config) → concat
Metadata you don't need (manifests, licenses, signatures) → discard
Class files → Fix the version conflict with dependencyOverrides first, fall back to first if needed

Avoid using first or last as a blanket strategy. They silence errors but can hide real problems — like shipping a jar with the wrong version of a transitive class that causes a NoSuchMethodError in production.