Spark Declarative Pipelines: First Look from a Scala Dev

Spark 4.1 introduces Spark Declarative Pipelines (SDP) — a framework for building managed ETL pipelines where you declare datasets and Spark handles the rest. The catch for Scala developers: authoring is Python and SQL only, with no JVM support yet.

What Declarative Pipelines Actually Are

If you've built Spark ETL pipelines, you know the pattern: write a bunch of DataFrame transformations, wire them together with orchestration logic (Airflow, custom scripts, cron), handle retries manually, manage checkpoints yourself, and hope the execution order is right. SDP replaces that manual plumbing with a declare-and-run model.

You define what your datasets should contain. SDP figures out how to execute them — dependency ordering, parallelism, checkpointing, retries, and incremental processing.

The three building blocks:

Streaming Tables — incrementally process new data as it arrives, supporting both streaming and batch sources
Materialized Views — precomputed query results stored as tables, recomputed from scratch on each pipeline run
Temporary Views — intermediate transformations scoped to the pipeline execution, not persisted

SDP analyzes the dependency graph between these datasets, validates it (catching cycles and missing references before execution), and orchestrates everything automatically.

The Programming Model

Pipeline datasets are defined in Python or SQL source files within a project directory. A YAML spec file ties it all together:

# spark-pipeline.yml
name: order_enrichment
libraries:
  - glob:
      include: transformations/**
catalog: production
database: orders
storage: /checkpoints/order_enrichment
configuration:
  spark.sql.shuffle.partitions: "200"

In SQL, you declare datasets with familiar DDL syntax. The STREAM keyword signals incremental processing:

-- Streaming Table: incrementally reads new orders from Kafka
CREATE STREAMING TABLE raw_orders
AS SELECT * FROM STREAM kafka_orders_source;

-- Materialized View: joins orders with customer data
CREATE MATERIALIZED VIEW enriched_orders
AS SELECT
  o.order_id,
  o.amount,
  c.name AS customer_name,
  c.region
FROM raw_orders o
INNER JOIN customers c ON o.customer_id = c.customer_id;

-- Materialized View: daily aggregation
CREATE MATERIALIZED VIEW daily_orders_by_region
AS SELECT region, DATE(order_date) AS day, COUNT(*) AS order_count
FROM enriched_orders
GROUP BY region, DATE(order_date);

In Python, decorators define the dataset type and the function body returns a DataFrame:

# Python — not Scala, but worth understanding the model
from pyspark import pipelines as dp

@dp.table
def raw_orders():
    return (
        spark.readStream.format("kafka")
        .option("kafka.bootstrap.servers", "broker1:9092")
        .option("subscribe", "orders")
        .load()
    )

@dp.materialized_view
def enriched_orders():
    return (
        spark.table("raw_orders")
        .join(spark.table("customers"), "customer_id")
        .select("order_id", "amount", "customer_name", "region")
    )

SDP reads these definitions, builds the dependency graph (it knows enriched_orders depends on raw_orders and customers), validates it, then executes in the right order with maximum parallelism.

Running a Pipeline

The CLI is straightforward:

# Validate without executing — catches syntax errors, missing tables, cyclic dependencies
spark-pipelines dry-run --spec spark-pipeline.yml

# Execute the pipeline
spark-pipelines run --spec spark-pipeline.yml

# Scaffold a new project
spark-pipelines init --name my_pipeline

Under the hood, spark-pipelines run translates to a spark-submit invocation, so it works with any cluster manager Spark supports. The CLI sends commands to a Spark Connect server via gRPC — this is not a thin wrapper, it's a full Spark Connect client.

The Scala Angle: Honest Assessment

Here's the part that matters most to readers of this site: SDP in Spark 4.1 is Python and SQL only. You cannot define pipeline datasets in Scala or Java. The framework mandates Spark Connect mode and requires remote execution origins — there's no JVM-based pipeline authoring path.

This is not a temporary oversight. SDP was built on top of Spark Connect's client-server architecture from the start. Scala support is on the roadmap (the Spark Connect Scala client exists and is expanding), but it's not in 4.1 and there's no confirmed release target.

What Scala developers can do today:

1. Consume SDP output tables from Scala applications. SDP-managed Streaming Tables and Materialized Views are regular catalog tables. Your existing Scala jobs can read them like any other table:

// Your Scala job reads from SDP-managed tables — no special API needed
val enrichedOrders = spark.table("production.orders.enriched_orders")

val highValueByRegion = enrichedOrders
  .filter(col("amount") > 1000)
  .groupBy("region")
  .agg(
    count("*").as("order_count"),
    sum("amount").as("total_amount")
  )

2. Use SQL for pipeline definitions. If your team's Scala expertise is primarily in the transformation logic, the SQL authoring path may be sufficient. SDP's SQL dialect is standard Spark SQL — no new syntax beyond CREATE STREAMING TABLE and CREATE MATERIALIZED VIEW. The dependency resolution, checkpointing, and retry logic are all handled by the framework regardless of authoring language.

3. Keep hand-rolled Scala orchestration for now. If your pipelines require Scala-specific logic (complex type-safe transformations, custom encoders, library code), SDP doesn't replace them yet. There's no urgency to migrate — your existing Spark Scala pipelines aren't going anywhere.

What SDP Replaces

The real value proposition isn't the authoring language — it's what you stop maintaining:

Manual dependency ordering — SDP resolves the graph. No more ensuring Job A finishes before Job B starts.
Checkpoint management — SDP handles checkpoint directories, recovery, and state cleanup for streaming tables.
Retry logic — failed datasets are retried automatically. No more custom retry wrappers in Airflow.
Incremental processing — Streaming Tables only process new data. No more manually tracking watermarks or offsets.
Parallelism — independent datasets execute concurrently without explicit configuration.

If you're currently managing this plumbing with Airflow DAGs, custom scripts, or hand-rolled orchestration, that's the code SDP aims to eliminate. Whether Python authoring is an acceptable trade-off for removing that operational burden depends on your team.

Relationship to Databricks DLT

If SDP looks familiar, it should. Spark Declarative Pipelines is the open-source evolution of Databricks Delta Live Tables (DLT), now rebranded as Lakeflow Declarative Pipelines on Databricks. The core model — declare datasets, let the framework orchestrate — is the same.

The differences: Databricks' managed version includes tighter Unity Catalog integration, expectations (data quality rules), and operational features like automatic cluster management. The open-source SDP in Spark 4.1 is the foundation — dataset declaration, dependency resolution, streaming and batch processing, and CLI execution.

Notably, DLT had briefly dropped Scala support before this open-source contribution. The fact that Scala support is on the SDP roadmap suggests Databricks and the Spark community recognize the demand.

What's on the Roadmap

The SDP roadmap is being developed in the open with the Spark community. Upcoming priorities include:

Scala authoring support via Spark Connect client expansion
Continuous execution mode for always-on pipeline processing
More efficient incremental processing beyond the current streaming model
Change Data Capture (CDC) as a built-in capability
Data quality expectations (already in Databricks' managed version, expected to land in open source)

Should You Adopt It?

If your pipelines are already working in Scala: No urgency. SDP doesn't obsolete Scala Spark applications — it's a different execution model for a specific class of ETL workloads. Wait for Scala authoring support if the language matters to your team.

If you're starting a new ETL pipeline and your team knows SQL: SDP's SQL authoring is genuinely simpler than hand-rolling orchestration. The dependency resolution and checkpoint management alone save real operational effort. Write the pipeline definitions in SQL, consume the output tables from Scala where needed.

If you're evaluating SDP vs. Airflow + Spark: The comparison isn't direct — SDP handles the "what" (dataset definitions and dependencies) while Airflow handles the "when" (scheduling, triggers, cross-system orchestration). For pure Spark-to-Spark ETL pipelines, SDP is more purpose-built. For pipelines that span multiple systems, you may still need an external orchestrator.

The bottom line for Scala developers: SDP is worth understanding even if you can't author in Scala yet. The output tables are fully accessible from Scala code, the SQL authoring path is a reasonable bridge, and Scala support is coming. This is the direction Spark ETL is heading — declarative over imperative.

For the broader Spark 4.1 feature set including real-time streaming and recursive CTEs, see the Spark 4.1 release highlights. For more on Spark Connect's architecture (which SDP builds on), see our Spark 4.0 overview. If you're building streaming pipelines and want low latency, check out the real-time mode deep dive.