Job Board
Consulting

DuckDB for Spark Scala Developers: What You Need to Know

DuckDB is an embedded, in-process columnar OLAP engine that runs inside your application — no cluster, no JVM serialization tax, sub-second startup. For Spark Scala developers, the entry point is the DuckDB JDBC driver on Maven Central, and the Scala 3 duck4s wrapper if you want a more idiomatic API. This is not a Spark replacement — it's a complementary tool that fits the gaps Spark was never designed to fill.

What DuckDB Actually Is

DuckDB is to OLAP what SQLite is to OLTP. It's a single-binary, embedded analytics database with no server, no daemon, and no cluster. You add a JAR to your classpath, open a JDBC connection, and you have a vectorized columnar query engine running inside your JVM process.

The engine is written in C++ with SIMD-vectorized execution and a Postgres-flavored SQL dialect. It reads Parquet, CSV, JSON, and Iceberg directly. It can query files on the local filesystem, on S3, or stream from a URL. It supports MERGE, window functions, CTEs, and most of the SQL surface area Spark users would expect.

What it doesn't do: distributed execution, streaming, fault-tolerant shuffles, or anything that requires a cluster. That's the whole point — it stays in one process.

Where DuckDB and Spark Don't Compete

Spark's strength is distributed compute. Once your data outgrows a single machine — or your job needs fault tolerance across a long shuffle, or you're writing a stateful streaming pipeline — Spark is the right tool and there is no real substitute in the open-source world.

DuckDB's strength is the opposite end of the spectrum. For datasets that comfortably fit on one machine, the cluster overhead Spark carries — driver setup, executor coordination, shuffle bookkeeping — becomes pure tax. A Spark job that takes 90 seconds because of cluster startup can finish in two seconds in DuckDB on the same machine. Not because DuckDB is "faster than Spark" in some absolute sense, but because it skips the parts of Spark you weren't using.

The honest framing: DuckDB wins on small data, Spark wins on big data, and the boundary depends on hardware, workload, and how much you care about cluster startup time. For most teams running both, the practical sweet spots look like:

  • Under ~10GB on one machine: DuckDB is almost always faster end-to-end.
  • 10–100GB: Either can work. Depends on whether your machine has the RAM and how much fault tolerance you need.
  • Over ~1TB or anything stateful streaming: Spark, every time.

This article assumes you already use Spark in production. The interesting question is what DuckDB gives you that Spark doesn't, and where to slot it into a stack you've already built.

The JDBC Entry Point

Everything DuckDB exposes to the JVM goes through duckdb_jdbc, published on Maven Central. As of this writing the current version is 1.5.2.

// build.sbt
libraryDependencies += "org.duckdb" % "duckdb_jdbc" % "1.5.2"

The driver implements JDBC 4.1, so anything that takes a java.sql.Connection works. In-memory databases use jdbc:duckdb:; file-backed databases use jdbc:duckdb:/path/to/file.

import java.sql.DriverManager

Class.forName("org.duckdb.DuckDBDriver")

// In-memory database — discarded when the connection closes
val conn = DriverManager.getConnection("jdbc:duckdb:")

val stmt = conn.createStatement()
val rs = stmt.executeQuery("SELECT count(*) FROM 's3://my-bucket/orders/*.parquet'")
rs.next()
println(s"row count: ${rs.getLong(1)}")

rs.close()
stmt.close()
conn.close()

That single query reads Parquet files directly off S3 with no schema declaration, no catalog setup, and no cluster. DuckDB infers the schema from the file footers and pushes predicates and projections down automatically.

For Spark Scala teams, the JDBC interface is enough to get real work done. The driver is small (a few MB), the API is the one you already know, and there's no second runtime to operate.

A More Scala-Native Option: duck4s

Raw JDBC works, but it's also raw JDBC — manual resource management, checked exceptions, mutable result sets. If you're on Scala 3, duck4s wraps the driver in something that feels like Scala.

// build.sbt — Scala 3 only
libraryDependencies += "com.softinio" %% "duck4s" % "0.1.4"

// Optional cats-effect integration with Resource and fs2.Stream
libraryDependencies += "com.softinio" %% "duck4s-cats-effect" % "0.1.4"

It uses Either[DuckDBError, T] for error handling, has withConnection and withPreparedStatement blocks for automatic resource cleanup, and supports for-comprehension composition.

import com.softinio.duck4s._

val result = withConnection("jdbc:duckdb:") { conn =>
  for {
    _    <- conn.execute("CREATE TABLE events (id INT, payload VARCHAR)")
    _    <- conn.execute("INSERT INTO events VALUES (1, 'login'), (2, 'click')")
    rows <- conn.query("SELECT id, payload FROM events ORDER BY id")
  } yield rows.map(r => (r.getInt(1), r.getString(2)))
}

If you're maintaining a Spark Scala codebase on Scala 2.13, duck4s isn't an option — it's Scala 3 only. The JDBC driver is. For Scala 3 services or tooling that sits alongside your Spark pipelines, duck4s is worth the import line.

Where DuckDB Fits Around a Spark Stack

Here's the part most "DuckDB vs Spark" articles miss: in a real production environment, you probably want both. Spark for the distributed lift, DuckDB for everything where spinning up a cluster would be silly.

CI and unit tests. Tests that need to assert behavior on Parquet, query layouts, or SQL semantics can use DuckDB to run in seconds rather than starting a local SparkSession. DuckDB reads the same Parquet files Spark writes, so your test fixtures don't have to change. For pure DataFrame logic you'll still want a Spark test, but for Parquet-shape assertions, DuckDB is faster.

Local exploration of pipeline output. Your Spark job lands a 500MB Parquet file in S3 and someone wants to know whether the join produced duplicates. You can spin up a notebook against a remote cluster, or you can SELECT count(*), count(distinct id) FROM 's3://...' from DuckDB on your laptop in two seconds. The second option is almost always the right move.

BI and dashboard backends. When the data fits in memory, DuckDB outperforms cluster-based systems for low-latency analytical queries. Pre-aggregate with Spark, drop the result into a DuckDB file or an in-memory database, and serve it. Many teams that previously deployed a Presto or Trino cluster purely for the BI layer have collapsed that into a DuckDB-backed service.

Lightweight ETL. Not every data pipeline needs a cluster. If you're moving 5GB of CSVs into Parquet on a schedule, DuckDB's COPY ... TO ... (FORMAT PARQUET) does the job in one statement. You don't need Spark for that, and Spark's startup cost makes it the wrong choice when the actual transformation takes less than a minute.

Iceberg interop. DuckDB can read Iceberg tables (with the iceberg extension), so if you're using Iceberg as your table format, you can query the same tables from both engines depending on the workload. Spark for writes and large queries; DuckDB for ad-hoc reads against a snapshot.

What DuckDB Won't Do

A balanced article needs this section. DuckDB has real limits, and they matter:

  • No native distribution. Once your data exceeds the RAM on one machine, you're done. There's no "scale out" mode. You either move to Spark or pre-aggregate on a cluster and bring the result down.
  • No streaming. DuckDB is batch-only. Anything that needs continuous query semantics, watermarks, or stateful aggregations across infinite input — Structured Streaming or Flink, not DuckDB. The Real-Time Mode work in Spark 4.1 is in a category DuckDB has no answer to.
  • No fault tolerance across long shuffles. A single-node engine can't recover from a node failure mid-query, because there's no other node. Long-running cluster jobs that survive an executor death need Spark.
  • Smaller ecosystem. Spark's connector library — Kafka, Kinesis, JDBC sources, every cloud warehouse, every catalog — is enormous. DuckDB has the basics (Parquet, CSV, JSON, S3, Postgres, MySQL, Iceberg) and a growing extension ecosystem, but it's nowhere near Spark's surface area.
  • Memory model. DuckDB spills to disk when it has to, but it's still a single-process engine. Memory pressure shows up as slow disk I/O rather than as graceful repartitioning across executors. Tune your machine for the largest query you expect.

These limits are not bugs. They're the trade-offs that make DuckDB fast at what it's good at. If you need what DuckDB doesn't offer, you needed Spark anyway.

A Practical Way to Try It

The lowest-friction way to evaluate DuckDB in a Spark Scala team is to add it to a single existing repository alongside Spark and see where it earns its place. The JAR is small, the JDBC API is standard, and there's nothing to deploy.

A reasonable first experiment: pick one CI test that initializes a Spark session purely to assert on a small Parquet shape. Rewrite it against DuckDB JDBC and measure the time difference. If your CI runs that test on every PR, the savings add up across hundreds of runs. The same logic applies to local notebooks for exploring pipeline output, or any internal CLI tool that currently shells out to pyspark/spark-shell for a quick query.

Once a few of those small wins have stuck, it becomes natural to reach for DuckDB whenever the workload doesn't actually need a cluster — and to keep Spark for everything that does.

The Pitch in One Line

DuckDB is not coming for Spark's job, and Spark is not coming for DuckDB's. They occupy adjacent niches, and a team that knows when to use each will spend less on infrastructure and ship faster than a team that defaults to one tool for every problem. For Spark Scala developers, learning DuckDB is a small investment with a large surface area of quietly useful applications — none of which require giving up Spark for the work Spark is actually good at.

Article Details

Created: 2026-05-08

Last Updated: 2026-05-08 10:12:08 PM