Job Board
Consulting

Spark Connect for Scala: Building Thin-Client Applications

Spark Connect decouples the application from the cluster with a gRPC protocol, and as of Spark 4.0 the Scala client has near-complete DataFrame and Dataset API parity with classic mode. Here's the architecture, how to wire it up from sbt, and what still doesn't work.

What Spark Connect Actually Is

Classic Spark embeds the driver inside your application JVM. Your code, the Catalyst optimizer, the scheduler, and the runtime that talks to executors all run in the same process. That's how Spark has worked since day one, and it's the reason a "Spark application" has historically meant a fat JVM that holds the whole world.

Spark Connect splits that JVM in two. The driver stays on the cluster as a long-running server. Your application becomes a thin client that translates DataFrame operations into unresolved logical plans, serializes them as protocol buffers, and ships them over gRPC to the server. Results come back as Apache Arrow row batches.

The wire protocol is Spark's own logical plan abstraction, which is why the same server can be driven from Python, Scala, Go, Swift, or Rust clients. The client doesn't need a Spark runtime — it just needs to speak the protocol.

What Changed in Spark 4.0

Spark Connect has existed since Spark 3.4, but for Scala developers it was a non-starter until 4.0. The Scala client lagged the Python client significantly: missing Dataset support, missing observability hooks, no mergeInto, no groupingSets. If you wrote Scala, you stayed on classic mode.

That changed with 4.0. Databricks summarized the state of the Scala client this way in the Spark 4.0 announcement: "all Spark SQL features offer near-complete compatibility between Spark Connect and Classic execution mode, with only minor differences remaining." The Scala client now covers:

  • The full Dataset and DataFrame API
  • Column operations and functions
  • User-Defined Functions (UDFs)
  • Catalog and KeyValueGroupedDataset
  • The majority of the Structured Streaming API — DataStreamReader, DataStreamWriter, StreamingQuery, StreamingQueryListener

The practical result: if your application speaks the DataFrame API, you can probably move it to Spark Connect without rewriting business logic.

Starting a Connect Server

The server ships with the standard Spark distribution. Extract Spark and run the launcher:

tar -xvf spark-4.0.0-bin-hadoop3.tgz
cd spark-4.0.0-bin-hadoop3

./sbin/start-connect-server.sh

The server listens on port 15002 by default. In production this would run on the cluster — on Kubernetes as a pod, on YARN as a long-running application, or on a dedicated VM — and survive across many short-lived client sessions.

The Scala Client: sbt Setup

The client library is separate from spark-sql and spark-core. Add it as a regular dependency:

libraryDependencies += "org.apache.spark" %% "spark-connect-client-jvm" % "4.0.0"

That's the only Spark dependency your client application needs. There's no spark-core, no spark-sql, no Hadoop client, no shaded Guava — all of that lives on the server. The client jar pulls in gRPC, Arrow, and a small amount of Spark code for plan construction.

This is the operational win that matters: your application's classpath is no longer dominated by Spark and its transitive dependency tree. Dependency conflicts between your code and Spark's bundled libraries become a non-issue, because Spark isn't on your classpath.

Creating a Session

Instead of SparkSession.builder().master("...").getOrCreate(), the Connect client uses a remote URL:

import org.apache.spark.sql.SparkSession

val spark = SparkSession.builder()
  .remote("sc://spark-connect.internal:15002")
  .getOrCreate()

spark.range(10).count()
// 10

The sc:// scheme tells the builder to construct a Connect client rather than a classic embedded driver. Once you have the session, every DataFrame call you'd write against classic Spark works the same way.

For deployments that need auth or TLS, the connection string carries those as parameters:

// Set this in deployment config, not in code
// export SPARK_REMOTE="sc://spark-connect.prod:443/;token=..."

val spark = SparkSession.builder().getOrCreate()
## Spark 4.0 Also Adds `spark.api.mode` For codebases that want to support both classic and Connect modes from the same source, Spark 4.0 introduced the [`spark.api.mode`](https://www.databricks.com/blog/introducing-apache-spark-40) configuration. Set it to `connect` and the session builder produces a Connect session; leave it as the default and you get classic.
// Same code, mode chosen at runtime
spark.conf.set("spark.api.mode", "CONNECT")
This is useful for libraries that need to work in both worlds without `if/else` branches at the session boundary. ## What Still Doesn't Work Spark Connect is not a 1:1 replacement. There are deliberate omissions that you need to know about before committing. ### No RDD API The RDD API is explicitly unsupported, and that isn't going to change. The RDD model assumes the driver and executors share a JVM and can ship closures freely; the Connect protocol can't represent that. If your code calls `spark.sparkContext.parallelize(...)`, `rdd.mapPartitions(...)`, or anything that drops down to RDDs, Connect won't run it. For most modern applications this isn't a problem — DataFrame is the primary API and has been for years. But legacy ETL code that mixes RDD operations with DataFrames will need refactoring. ### No `SparkContext` The client never gets a `SparkContext`. Anything that reads cluster-wide static configuration through `sc.getConf`, registers a listener via `sc.addSparkListener`, or accesses the executor environment directly is unavailable. Spark UI metrics, executor logs, and listener-based observability move server-side. ### Streaming Has Gaps Structured Streaming is supported but not at full parity. The standard read/write API works (`readStream`, `writeStream`, `start`, `awaitTermination`), and `StreamingQueryListener` is available. The newer state APIs — particularly `transformWithState` and `flatMapGroupsWithState` — are partially supported but have edge cases that you should validate against your specific workload. If you're running complex stateful streams, test on Connect before committing. ### UDFs Need Classpath Setup UDFs work, but the server needs the bytecode of any classes your UDF closes over. The client uploads compiled artifacts to the server through `spark.addArtifact(...)`. For standalone UDF code in a REPL or batch job, register a class finder against your build output directory:
import org.apache.spark.sql.connect.client.REPLClassDirMonitor

val classFinder = new REPLClassDirMonitor("/absolute/path/to/build/classes")
spark.registerClassFinder(classFinder)
spark.addArtifact("/absolute/path/to/jar/dependency.jar")
In an sbt-assembly workflow, this usually means uploading your fat jar to the server before running the UDF-bearing query. ## Why Bother Apart from the API parity in 4.0, there are concrete operational reasons to consider Connect. **Client isolation.** A bug in your application — an OOM, an infinite loop, a leaked thread — can't take down the Spark driver, because the driver is on a different machine. Conversely, a server-side issue doesn't crash your application; you reconnect and try again. **Lighter deployments.** A client without `spark-core` is a normal JVM application. It builds faster, deploys faster, and starts in seconds rather than the tens of seconds it takes a classic Spark driver to initialize. Think Lambda-style functions, request-driven services that issue ad-hoc Spark queries, or short-lived CLI tools. **Shared clusters with fair isolation.** Multi-tenant platforms — notebook environments, internal data tools, analyst-facing services — can route many concurrent users into a single Spark Connect server (or a small pool) and have each session isolated at the protocol level. No more "one notebook OOM'd and killed the cluster." **Language-agnostic protocol.** A single Spark cluster can serve Scala batch jobs, Python notebooks, Go services, and Rust CLI tools without each language needing its own embedded Spark runtime. The Scala API is the most mature, but you're no longer locked to JVM-only consumers. **Easier IDE and tooling integration.** Because the client is a regular JVM library, attaching a debugger, profiling allocations, or instrumenting calls is normal Java tooling — not "exec into a Spark driver pod and hope". ## When Connect Is the Wrong Choice Connect is not a free upgrade. There are cases where classic mode is still the right answer. **Latency-sensitive interactive workloads.** The gRPC round-trip adds a small per-call cost. For tight loops of small operations (think interactive REPL exploration on tiny datasets), the difference is noticeable. **Workloads that depend on RDD operations.** Migration to DataFrame is usually the right answer, but if you have a large body of working RDD code, switching to Connect means rewriting it first. **Tightly coupled execution-side extensions.** Custom `Partitioner` implementations, custom `RDD` subclasses, deep Catalyst extensions — anything that assumes you can run arbitrary JVM code in the driver — needs to move server-side. That's not impossible, but it's not a drop-in change either. **Existing production pipelines on classic that aren't broken.** "We should move to Connect" is rarely a strong enough reason to change a working batch ETL. The benefits are real for new applications and multi-tenant platforms; for a nightly job that runs on YARN and produces a Parquet table, classic remains fine. ## A Migration Checklist If you're evaluating Spark Connect for a Scala application, work through these in order: 1. **Audit for RDD usage.** Grep the codebase for `sparkContext`, `.rdd`, `parallelize`, and `mapPartitions` on `RDD[T]`. Refactor what you find to DataFrame equivalents. 2. **Check streaming features.** If you use `transformWithState`, custom state encoders, or rely on specific `forEachBatch` behavior, validate them on Connect against your real workload. 3. **Identify UDF dependencies.** List every UDF and the classes it touches. Plan how those artifacts get shipped to the server. 4. **Prototype the connection.** Stand up a Connect server, swap `master("local[*]")` for `remote("sc://...")` in a test session, and run your job's read/transform/write path end-to-end. 5. **Measure latency.** Compare wall-clock time for representative queries on classic vs Connect. Most jobs see no meaningful difference; very small interactive queries see a small constant overhead. 6. **Plan the rollout.** Treat the Connect server as a stateful production service — it needs monitoring, restarts, version upgrades, and capacity planning, just like any long-running cluster component. ## What's Next If you're maintaining Spark Scala applications today, Spark Connect is finally worth a serious look. The Spark 4.0 release closed the API gap that made it impractical for Scala-first shops, and the operational benefits — client isolation, dependency cleanliness, multi-tenant patterns — solve real problems that are awkward to solve with classic mode. The natural next step is to read [What's New in Spark 4.0 for Scala Developers](/latest/2026/03/16/whats-new-spark-4-scala/) for the full Scala-relevant feature set, and the [Spark 3 to Spark 4 upgrade guide](/latest/2026/03/17/upgrading-spark-3-to-4/) if you're not yet on a Connect-capable version. The official [Spark Connect overview](https://spark.apache.org/docs/latest/spark-connect-overview.html) has the up-to-date list of supported APIs and connection-string options.
Article Details
More Spark Scala News and Updates

Created: 2026-05-15

Last Updated: 2026-05-15 10:47:28 PM