Using spark-submit with Private Maven Dependencies

spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.

Why Local Builds Work and spark-submit Doesn't

Your sbt build resolves an internal artifact like com.acme:spark-data-utils_2.13:1.4.0 cleanly because sbt is configured for your private repository. You build, you test, everything's green. Then you submit:

spark-submit \
  --master yarn \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --class com.acme.MyJob \
  myjob.jar

And resolution dies:

``` :: problems summary :: :::: WARNINGS module not found: com.acme#spark-data-utils_2.13;1.4.0

    ==== local-m2-cache: tried
      file:.../.m2/repository/com/acme/spark-data-utils_2.13/1.4.0/spark-data-utils_2.13-1.4.0.pom

    ==== central: tried
      https://repo1.maven.org/maven2/com/acme/spark-data-utils_2.13/1.4.0/spark-data-utils_2.13-1.4.0.pom

::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::::::::: :: UNRESOLVED DEPENDENCIES :: :::::::::::::::::::::::::::::::::::::::::::::: :: com.acme#spark-data-utils_2.13;1.4.0: not found ```

spark-submit doesn't read build.sbt, doesn't read ~/.sbt/.credentials, and doesn't know about your internal repository. It runs Ivy as its dependency resolver, with Ivy's defaults — local Maven cache, Maven Central, Spark's packages repo, and that's it.

To fix this, you need to tell Ivy two things: where the internal repository is, and how to authenticate against it.

Step 1: Add the Repository with –repositories

The simplest fix is the --repositories flag. It accepts a comma-separated list of URLs that Ivy will check in addition to its defaults:

spark-submit \
  --master yarn \
  --repositories https://nexus.acme.internal/repository/maven-releases/ \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --class com.acme.MyJob \
  myjob.jar

If your internal repo is unauthenticated (or proxies a network that already filters access), this is enough — Ivy downloads the artifact and ships it to the executors via spark.jars. Most internal repos require authentication, though, and that's where this gets more involved.

Step 2: Supply Credentials via ~/.ivy2/.credentials

When the repository requires auth, the unauthenticated request gets a 401 and resolution fails the same way it would for a missing artifact. Ivy reads credentials from ~/.ivy2/.credentials by default (similar in shape to sbt's credentials file, but a separate file in a separate directory):

# ~/.ivy2/.credentials
realm=Sonatype Nexus Repository Manager
host=nexus.acme.internal
user=jane.doe
password=abc123-personal-token

The same four-field format and the same gotchas apply as for sbt's credentials file:

realm must exactly match the realm string the server sends in the WWW-Authenticate header. Find it with curl -v if you're not sure.
host is the hostname only — no scheme, no path, no trailing slash. nexus.acme.internal, not https://nexus.acme.internal/.
user and password should be an account name and an API token, not a password. Most repository managers prefer or require tokens for non-browser clients.

With the file in place on the same machine running spark-submit, the previous command works:

spark-submit \
  --master yarn \
  --repositories https://nexus.acme.internal/repository/maven-releases/ \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --class com.acme.MyJob \
  myjob.jar

Where Resolution Actually Happens

A subtle but important detail: --packages resolution runs on whichever machine is the driver. In client deploy mode, that's the machine running spark-submit. In cluster deploy mode, the driver runs on the cluster — typically a YARN container, a Kubernetes pod, or a standalone cluster worker.

This matters because:

Client mode — Your laptop or edge node resolves the artifacts, then ships them to executors via spark.jars. Only your machine needs network access to the internal repo, and only your machine needs the credentials file.
Cluster mode — A driver process spun up somewhere on the cluster resolves the artifacts. That process needs network access to the repo and a credentials file in its own $HOME/.ivy2/.credentials. Whatever launches drivers (the YARN node manager, the Kubernetes pod template, etc.) is responsible for putting the credentials file in place.

If you're hitting "works on my machine, fails when submitted in cluster mode," this is almost always why. The resolution worked locally because your ~/.ivy2/.credentials was readable. The driver container has a different home directory and no credentials.

Step 3: Use a Custom ivysettings.xml for Full Control

For anything beyond the simplest case — multiple internal repositories, repository chains, or credentials embedded with the resolver definitions — write a custom ivysettings.xml and point Ivy at it.

A few things this configuration does that --repositories plus ~/.ivy2/.credentials doesn't:

Multiple repositories in a single chain. Releases and snapshots are separate paths on Nexus and Artifactory; both need to be configured for projects that consume snapshots.
changingPattern=".*-SNAPSHOT" tells Ivy to re-check snapshot artifacts for updates rather than caching them indefinitely. Without this, a snapshot that was resolved last week stays cached forever.
Credentials defined inline with environment variable interpolation. The CI system sets NEXUS_USER and NEXUS_PASSWORD; the file itself is checkable into version control.

Tell spark-submit to use this settings file via the Spark Ivy settings property:

spark-submit \
  --master yarn \
  --conf spark.jars.ivySettings=/etc/spark/ivysettings.xml \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --class com.acme.MyJob \
  myjob.jar

spark.jars.ivySettings is the supported Spark configuration for this — it's read on the driver before --packages resolution begins, and it replaces Ivy's default settings entirely (so you don't pass --repositories alongside it; the settings file owns the resolver chain).

In cluster mode, the file path must be valid on the driver host, not on the submit host. Either bake it into the cluster's base image, mount it into driver containers, or use --files to ship it from the submit machine and reference it by basename — though spark.jars.ivySettings is read before --files are localized, so the bake-in or mount approach is more reliable.

CI/CD: Writing the Credentials File at Submit Time

For pipelines that submit Spark jobs, the cleanest pattern is to materialize the credentials file from secrets manager values just before invoking spark-submit:

# In a CI job script
mkdir -p ~/.ivy2

cat > ~/.ivy2/.credentials <<EOF
realm=Sonatype Nexus Repository Manager
host=nexus.acme.internal
user=$NEXUS_USER
password=$NEXUS_PASSWORD
EOF

chmod 600 ~/.ivy2/.credentials

spark-submit \
  --master yarn \
  --repositories https://nexus.acme.internal/repository/maven-releases/ \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --class com.acme.MyJob \
  myjob.jar

The chmod 600 matters — Ivy doesn't enforce restrictive permissions, but leaving a credentials file world-readable on a shared CI runner is exactly the kind of mistake that ends up in a postmortem.

For pipelines that run jobs in cluster mode, the credentials file needs to land on the driver host, not the CI runner. The cleanest version of this is usually a Kubernetes secret mounted as ~/.ivy2/.credentials on the driver pod, or a YARN node-manager-provided file that gets symlinked into place when a container starts.

Why a Fat Jar Is Often the Better Answer

Configuring --packages against a private repository works, but it's not the only option, and for production deployments it's often not the best one. The alternative is to bake your internal dependencies into the application jar with sbt assembly:

// build.sbt
libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql"        % "3.4.1" % "provided",
  "com.acme"         %% "spark-data-utils" % "1.4.0",
)

Then submit just the assembled jar:

spark-submit \
  --master yarn \
  --class com.acme.MyJob \
  myjob-assembly-0.1.0.jar

Why this is usually better:

The cluster doesn't need credentials. Your CI machine resolved the dependencies once during the build. The cluster sees a single self-contained jar.
No surprise resolution at submit time. The version of spark-data-utils that runs in production is exactly the one your CI tested with — it can't drift because Ivy resolved a slightly different transitive dependency.
Submission is faster. No Ivy resolution delay, no jar download from your driver to your executors, no failure modes related to flaky network paths between the cluster and your repository.

--packages is genuinely useful for ad-hoc work — exploring a notebook with a one-off library, testing a new version of an internal lib without rebuilding your application jar — but for scheduled production workloads, the fat jar approach removes a whole category of failure modes.

If your internal library uses provided scope for its Spark dependencies (it should), the fat jar is small enough that this tradeoff is essentially free.

Verifying Resolution End-to-End

If you do go the --packages route, verify the full path before scheduling jobs against it. Run a no-op submit that just resolves and prints the classpath:

spark-submit \
  --master yarn \
  --conf spark.jars.ivySettings=/etc/spark/ivysettings.xml \
  --packages com.acme:spark-data-utils_2.13:1.4.0 \
  --verbose \
  --class org.apache.spark.examples.SparkPi \
  $SPARK_HOME/examples/jars/spark-examples_*.jar 10

--verbose prints the resolved jar list before the application starts. You're looking for a line like:

Resolved jars: file:/.../jars/com.acme_spark-data-utils_2.13-1.4.0.jar, ...

If your artifact appears in that list, resolution worked. If you instead see UNRESOLVED DEPENDENCIES, the most common causes are:

Wrong realm string. Ivy doesn't send the password if the realm doesn't match. The 401 looks identical to a "not found" error.
--repositories URL has a typo or wrong path. Nexus and Artifactory both have multiple paths under the same host (maven-releases, maven-snapshots, maven-public). Pasting the wrong one is a common Friday-afternoon mistake.
The credentials file is on the wrong machine. In cluster mode, the driver is what reads it — not the submit host.
Ivy is using a stale failed-resolution cache. When Ivy fails to resolve an artifact, it caches that failure. Clear it with rm -rf ~/.ivy2/cache/com.acme and retry.

Summary

The minimum viable setup for spark-submit --packages against a private Maven repo is:

--repositories <repo-url> to tell Ivy where to look.
~/.ivy2/.credentials on the driver host with matching realm and host.
Awareness of which machine is the driver in your deploy mode.

For anything more than the basic case — multiple repos, snapshots, CI/CD — promote the configuration to a custom ivysettings.xml referenced via spark.jars.ivySettings. And before adopting --packages for production jobs, weigh whether a fat jar built with sbt assembly would remove the whole problem.