Spark Connect for Scala: Building Thin-Client Applications
Spark Connect decouples the application from the cluster with a gRPC protocol, and as of Spark 4.0 the Scala client has near-complete DataFrame and Dataset API parity with classic mode. Here's the architecture, how to wire it up from sbt, and what still doesn't work.
What Spark Connect Actually Is
Classic Spark embeds the driver inside your application JVM. Your code, the Catalyst optimizer, the scheduler, and the runtime that talks to executors all run in the same process. That's how Spark has worked since day one, and it's the reason a "Spark application" has historically meant a fat JVM that holds the whole world.
Spark Connect splits that JVM in two. The driver stays on the cluster as a long-running server. Your application becomes a thin client that translates DataFrame operations into unresolved logical plans, serializes them as protocol buffers, and ships them over gRPC to the server. Results come back as Apache Arrow row batches.
The wire protocol is Spark's own logical plan abstraction, which is why the same server can be driven from Python, Scala, Go, Swift, or Rust clients. The client doesn't need a Spark runtime — it just needs to speak the protocol.
What Changed in Spark 4.0
Spark Connect has existed since Spark 3.4, but for Scala developers it was a non-starter until 4.0. The Scala client lagged the Python client significantly: missing Dataset support, missing observability hooks, no mergeInto, no groupingSets. If you wrote Scala, you stayed on classic mode.
That changed with 4.0. Databricks summarized the state of the Scala client this way in the Spark 4.0 announcement: "all Spark SQL features offer near-complete compatibility between Spark Connect and Classic execution mode, with only minor differences remaining." The Scala client now covers:
The full Dataset and DataFrame API
Column operations and functions
User-Defined Functions (UDFs)
Catalog and KeyValueGroupedDataset
The majority of the Structured Streaming API — DataStreamReader, DataStreamWriter, StreamingQuery, StreamingQueryListener
The practical result: if your application speaks the DataFrame API, you can probably move it to Spark Connect without rewriting business logic.
Starting a Connect Server
The server ships with the standard Spark distribution. Extract Spark and run the launcher:
tar -xvf spark-4.0.0-bin-hadoop3.tgz
cd spark-4.0.0-bin-hadoop3
./sbin/start-connect-server.sh
The server listens on port 15002 by default. In production this would run on the cluster — on Kubernetes as a pod, on YARN as a long-running application, or on a dedicated VM — and survive across many short-lived client sessions.
The Scala Client: sbt Setup
The client library is separate from spark-sql and spark-core. Add it as a regular dependency:
That's the only Spark dependency your client application needs. There's no spark-core, no spark-sql, no Hadoop client, no shaded Guava — all of that lives on the server. The client jar pulls in gRPC, Arrow, and a small amount of Spark code for plan construction.
This is the operational win that matters: your application's classpath is no longer dominated by Spark and its transitive dependency tree. Dependency conflicts between your code and Spark's bundled libraries become a non-issue, because Spark isn't on your classpath.
Creating a Session
Instead of SparkSession.builder().master("...").getOrCreate(), the Connect client uses a remote URL:
The sc:// scheme tells the builder to construct a Connect client rather than a classic embedded driver. Once you have the session, every DataFrame call you'd write against classic Spark works the same way.
For deployments that need auth or TLS, the connection string carries those as parameters: