Dependency Confusion Attacks and Your Private Spark Libraries

Five years after Alex Birsan's original dependency confusion research collected more than $130,000 in bug bounties from Apple, Microsoft, PayPal, and Shopify, the same class of supply-chain attack is still landing in JVM builds. Spark Scala teams are an especially easy target — and sbt's default resolver behavior, combined with repository managers that auto-proxy Maven Central, makes the attack surprisingly close to a one-line publish on Sonatype's side.

What the Attack Actually Is

Dependency confusion exploits the way package managers merge results from multiple repositories. If your build looks for an artifact that lives in a private repository, and the resolver is also allowed to see a public repository, a package published to the public repository with the same coordinates and a higher version number can silently win. Your build compiles. Your CI goes green. And a JAR you did not write, from a publisher you do not trust, runs inside your Spark driver.

Birsan's original research was on npm, PyPI, and RubyGems, but the core mechanic applies to any package manager with a fallback to a public registry. Maven Central and sbt are not immune. The reason you hear about npm more often is that npm's namespace is flat and package names are frequently guessable from public JavaScript bundles; Maven's groupId gives an organization slightly more isolation, but "slightly more" is not the same as "safe," and Maven Central's self-service publishing through the Central Portal has made it easier than ever to register a namespace that looks like it belongs to somebody else.

Why Spark Scala Teams Are an Inviting Target

If your team has been running Spark in production for more than a couple of quarters, you almost certainly have internal libraries. A shared UDF collection. A custom source or sink. A helper for your company's schema conventions. Those artifacts are published to a private repository with coordinates that look like com.yourcompany.spark:spark-udfs_2.13:1.4.2.

Three things about that picture make Spark Scala teams particularly exposed:

Coordinates leak. Stack traces in Slack threads, compiled JARs uploaded to a public CI artifact store, dependency graphs in internal documentation that end up in a public wiki — internal artifact coordinates get out in ways most teams do not audit.

Maven Central publishing is self-service. Any GitHub user can verify ownership of a groupId corresponding to a domain they control. com.yourcompany is protected because your organization owns the domain; com.yourcompany-spark, com.yourcompany.internal, or a typo variant is not. Namespace squatting on Central is an inexpensive attack for someone willing to wait.

The blast radius is enormous. A malicious artifact that gets pulled into a Spark driver runs with the driver's credentials — the IAM role of an EMR cluster, the service principal of a Databricks job, the Kubernetes service account of a SparkApplication custom resource. That's typically the most privileged identity in a data platform. It can read the whole lakehouse and usually has some write path into production.

This is not a hypothetical. The bounties listed in the original research were paid by some of the largest JVM shops in the world. The attack works.

Why sbt's Defaults Don't Save You

sbt's dependency resolution follows a set of combined resolvers called externalResolvers. By default, that list is the union of whatever you declare in resolvers plus Maven Central plus the local Ivy cache. Coursier (sbt's default resolver since 1.3) then asks every configured repository, in order, for the coordinates you requested, and uses the first one that responds with a satisfying version.

"First one that responds" sounds safe until you realize that a missing isolated flag, a misordered resolvers :=, or a transitive dependency that was never pinned can all cause the resolver to fall back to Maven Central — and if someone has published com.yourcompany.spark:spark-udfs_2.13:9.9.9 there, the resolver is happy to use it. Version selection across repositories is a "highest wins" conflict resolution strategy in most modern build tools. That is exactly the mechanism dependency confusion abuses.

Here's the naive configuration that leaves you open:

// build.sbt — this is what dependency confusion exploits
resolvers += "Internal" at "https://nexus.yourcompany.com/repository/maven-releases/"

libraryDependencies += "com.yourcompany.spark" %% "spark-udfs" % "1.4.2"

resolvers += appends to the defaults. Maven Central is still in the list. A 9.9.9 on Central wins over your 1.4.2 internal release the moment the resolver has a reason to consider it — which, with transitive dependencies and version conflict resolution, it often does.

The Mitigations That Actually Work

There are four layers of defense. No single one is sufficient on its own; stack them.

1. Pin Resolution to Your Repository Manager at the Launcher Level

This is the most important control and the one most teams skip. sbt supports a global configuration file at ~/.sbt/repositories that lists the repositories the launcher is allowed to see. Combined with the JVM flag -Dsbt.override.build.repos=true, it forces sbt to ignore any resolvers or externalResolvers defined in build.sbt and use only your repository manager.

# ~/.sbt/repositories
[repositories]
  local
  company-repo: https://nexus.yourcompany.com/repository/maven-public/

# In your CI and developer sbt launcher invocation
sbt -Dsbt.override.build.repos=true compile

This is the single most effective mitigation because it applies regardless of what the build file says. A malicious PR that adds resolvers += "Maven Central" at ... does not bypass it. The repository manager (Nexus, Artifactory, CodeArtifact) proxies Maven Central internally with its own priority rules, so your build still resolves public dependencies — but only through a path you control. If you have not already read our breakdown of private Maven repositories for Spark Scala teams, that's the foundation this mitigation sits on top of.

2. Use Repository Priority Rules on the Manager Side

Every serious repository manager lets you order the repositories inside a "group" or "virtual" repository. The correct order is internal-first, public-last. Additionally, configure a routing rule (Nexus calls these "routing rules"; Artifactory calls them "include/exclude patterns") that refuses to fetch your organization's groupId from Maven Central at all.

In Nexus, this looks like a routing rule that blocks ^/com/yourcompany/.* on the Maven Central proxy. Any attempt to fetch com.yourcompany coordinates from the public proxy returns a hard 404, not a fallback. Artifactory's equivalent is an exclude pattern under the remote repository's advanced settings.

This is the control that stops the attack dead even if layer 1 is somehow bypassed — the resolver cannot obtain a malicious artifact with your coordinates through your infrastructure, period.

3. Lock Down `externalResolvers` in the Build

For builds where the launcher-level control is not practical (for example, contributors cloning your OSS repository), use externalResolvers directly instead of resolvers +=. This replaces the defaults rather than appending to them.

// build.sbt — explicit resolver list, no Maven Central fallback
externalResolvers := Seq(
  "Internal" at "https://nexus.yourcompany.com/repository/maven-public/"
)

// Or, if you still need Central but want control of the order:
externalResolvers := Seq(
  "Internal" at "https://nexus.yourcompany.com/repository/maven-releases/",
  Resolver.mavenCentral
)

The first form is the one to prefer. If your Nexus group repository proxies Maven Central, you do not need Central as a separate resolver, and leaving it off means a misconfigured Nexus is the only failure mode — not an attacker-controlled Central upload.

4. Pin Transitive Versions for Your Own Artifacts

Dependency confusion most commonly exploits transitive dependencies, because those are the ones developers don't look at. If your internal library is called com.yourcompany.spark:spark-udfs, any project that depends on a downstream library that depends on spark-udfs is pulling it transitively. Use dependencyOverrides to force the version from your repository manager for every artifact in your groupId, regardless of whatever version any transitive dependency requests.

// build.sbt — force internal versions for your groupId
dependencyOverrides ++= Seq(
  "com.yourcompany.spark" %% "spark-udfs" % "1.4.2",
  "com.yourcompany.spark" %% "spark-connectors" % "0.7.1",
  "com.yourcompany.schemas" %% "event-schemas" % "3.2.0"
)

dependencyOverrides forces the revision without adding a direct dependency. An attacker who publishes com.yourcompany.spark:spark-udfs_2.13:9.9.9 to Central cannot win a conflict-resolution battle because the override pins to 1.4.2 explicitly.

A Minimum Viable Configuration

If you do nothing else after reading this, do these three things:

Create ~/.sbt/repositories on your build agents pointing at your Nexus/Artifactory/CodeArtifact group repository, and launch sbt with -Dsbt.override.build.repos=true.
Add a routing rule on the public-repo proxy that blocks your organization's groupId prefix.
Change resolvers += "Internal" at ... to externalResolvers := Seq("Internal" at ...) in every internal Spark project.

That's a one-afternoon change for most teams and it takes dependency confusion from "one motivated attacker away" to "would require a compromise of our repository manager, which is a different and much harder attack."

The Larger Point

Dependency confusion is not a novel vulnerability. It's a consequence of a design decision baked into most build tools: merge results from all configured sources, resolve conflicts by picking the highest version. That decision is great for developer ergonomics and terrible for security when one of those sources is a self-service public registry.

Spark platforms concentrate an unusual amount of privilege in a small number of JVMs. The Spark driver often has read access to the entire lakehouse and write access to the pipelines that feed production dashboards. Treating dependency resolution as a trust boundary — not a convenience — is the cost of running that kind of platform safely. Most teams have not paid that cost yet. It's a small bill to pick up now; it's a very large bill if an attacker picks it up first.