Structured Streaming Real-Time Mode: A Practical Look

Spark 4.1 ships a new real-time execution mode for Structured Streaming that breaks the microbatch barrier, targeting single-digit millisecond p99 latency for stateless queries. Here's what it does, how it works, and when you should use it.

What Real-Time Mode Actually Is

If you've used Structured Streaming, you know the model: data arrives, Spark accumulates it into a microbatch, processes the batch, writes the output, and repeats. The latency floor is the batch interval — typically seconds to minutes. Even with Trigger.ProcessingTime("0 seconds"), you're still paying for batch boundaries, sequential stage execution, and disk-based shuffle between stages.

Real-time mode (SPARK-52330) replaces this with a fundamentally different execution strategy. Instead of chopping the stream into discrete batches, it processes records as they arrive with three architectural changes that eliminate the latency bottlenecks of microbatch processing.

This is not the old Trigger.Continuous that's been sitting in the API since Spark 2.3. That experiment had severe limitations and never left experimental status. Real-time mode is a new implementation built from the ground up.

The Three Key Changes

Continuous Data Flow

In microbatch mode, the engine reads data, processes it, writes output, then starts reading again. Each cycle introduces idle time between batches.

Real-time mode uses long-running batches (default: 5 minutes) where the engine continuously reads fresh data from the source as it becomes available. The batch duration controls how often the query checkpoints — not how often it processes data. Data flows through the pipeline continuously within each batch window.

Concurrent Stage Scheduling

Microbatch executes stages sequentially: all mappers finish before any reducer starts. If you have a query with multiple stages, the downstream stages sit idle while upstream stages complete.

Real-time mode runs all stages concurrently. Reducers start processing shuffle output as soon as upstream tasks produce it, rather than waiting for the entire map stage to complete. This alone eliminates a significant chunk of latency in multi-stage queries.

Streaming Shuffle

Traditional Spark shuffles write intermediate data to disk before downstream tasks read it. This disk round-trip adds latency even when the data is small.

Real-time mode replaces this with an in-memory streaming shuffle that passes data between stages immediately as it's produced. No disk writes, no waiting for the full shuffle to materialize.

How to Enable It

The opt-in is a single trigger change:

import org.apache.spark.sql.streaming.Trigger

val query = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
  .option("subscribe", "events")
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "broker1:9092,broker2:9092")
  .option("topic", "processed-events")
  .option("checkpointLocation", "/checkpoints/events")
  .outputMode("update")
  .trigger(Trigger.RealTime())
  .start()

The Trigger.RealTime() method accepts an optional batch duration parameter that controls checkpoint frequency:

import org.apache.spark.sql.streaming.Trigger

// Checkpoint every 5 minutes (default)
.trigger(Trigger.RealTime())

// Checkpoint every 2 minutes — more frequent recovery points, slightly more overhead
.trigger(Trigger.RealTime("2 minutes"))

// Compare with microbatch — this is what you're replacing
.trigger(Trigger.ProcessingTime("1 second"))

The checkpoint interval is a trade-off: shorter intervals mean faster recovery after failures but add overhead. Longer intervals reduce overhead but mean more data to reprocess on restart.

Current Scope in Spark 4.1

Real-time mode in the open-source Spark 4.1 release has a deliberately narrow scope. The team shipped stateless query support first, with stateful operations on the roadmap.

What's supported:

Sources: Kafka
Sinks: Kafka and Foreach (via forEachWriter)
Output mode: Update only
Operations: Stateless — projections (select, map, flatMap, mapPartitions) and selections (where, filter)
Delivery semantics: At-least-once

What's not supported yet:

Stateful operations (aggregations, windowed joins, transformWithState)
Stream-stream joins
Session windows
Delta Lake as source or sink
forEachBatch
Append or Complete output modes

This scope is narrower than what Databricks offers in their managed runtime, where real-time mode extends to aggregations, tumbling/sliding windows, dropDuplicates, broadcast joins, and transformWithState. Expect the open-source support to broaden in subsequent Spark releases.

At-Least-Once vs Exactly-Once

This is the trade-off you need to understand before switching. Microbatch Structured Streaming provides exactly-once delivery semantics — each record is processed and written to the sink exactly once, even across failures. Real-time mode provides at-least-once semantics — records are guaranteed to be processed, but may be processed more than once after a failure.

The reason: microbatch commits offsets atomically with the output, so there's no window where data can be double-processed. Real-time mode decouples processing from checkpointing (checkpoints happen on the batch duration interval), so a failure between checkpoints means replaying data from the last checkpoint.

For many workloads, this is fine:

Fraud detection: A duplicate alert is far better than a missed one. Downstream systems typically deduplicate anyway.
ML feature serving: Writing the same feature value twice is idempotent.
Real-time personalization: Showing the same recommendation twice is harmless.

For workloads that require strict exactly-once (financial transactions, inventory updates), stay on microbatch or build idempotent sinks.

Compute Sizing

Real-time mode requires enough cluster resources to run all stages concurrently. In microbatch mode, stages share the same task slots sequentially. In real-time mode, they all run at the same time.

The math is straightforward: add up the task slots needed for each stage.

// Example: Kafka source (8 partitions) -> filter/transform -> Kafka sink

// Microbatch: 8 task slots is enough (stages share slots sequentially)
// Real-time: still 8 slots for a single-stage stateless query

// With a shuffle stage (e.g., repartition before sink):
// Stage 1: 8 tasks (read from Kafka)
// Stage 2: 20 tasks (shuffle partitions)
// Real-time needs: 8 + 20 = 28 concurrent task slots

// Three stages:
// Stage 1: 8 tasks, Stage 2: 20 tasks, Stage 3: 20 tasks
// Real-time needs: 8 + 20 + 20 = 48 concurrent task slots

If your cluster doesn't have enough slots, tasks will queue and you'll lose the latency benefit. Size your cluster for the total concurrent task count, not just the largest single stage.

How This Compares to Trigger.Continuous

If you've seen Trigger.Continuous in the Spark API, you might wonder how real-time mode differs. In short: real-time mode is the replacement.

Trigger.Continuous was introduced in Spark 2.3 as an experimental feature. It supported only projections and selections on Kafka sources, had no aggregation support, and never progressed beyond experimental status. It was architecturally limited — essentially a thin wrapper that skipped batch boundaries but didn't address the deeper issues of sequential stage execution and disk-based shuffle.

Real-time mode is a ground-up redesign that addresses the architectural bottlenecks (concurrent stages, streaming shuffle) and has a clear roadmap for stateful operation support. If you were waiting for Trigger.Continuous to mature, real-time mode is the answer.

Ideal Workloads

Real-time mode makes sense when:

Latency matters more than exactly-once guarantees. Fraud detection, ML feature serving, real-time personalization, anomaly detection — workloads where acting on data in milliseconds is more valuable than guaranteeing no duplicates.
Your query is stateless. In Spark 4.1's open-source release, this means filters, projections, and transformations without aggregation or state. Think Kafka-to-Kafka ETL: parse, enrich, filter, route.
You're already on Structured Streaming. The API change is a single line — swap the trigger. Your existing DataFrame logic stays the same.

Real-time mode does not make sense for:

Analytical ETL and medallion architectures. These don't need millisecond latency and benefit from microbatch's exactly-once guarantees and efficient bulk processing.
Workloads with stateful operations. Until the open-source support expands, aggregations and joins still need microbatch.
Cost-sensitive workloads. The always-on concurrent execution model uses more cluster resources than microbatch, which can share task slots across sequential stages.

What's Next

The Spark 4.1 release ships real-time mode with stateless support as the foundation. The SPIP design document outlines a phased rollout: stateless queries first, then stateful operations (aggregations, windowed joins, transformWithState) in subsequent releases.

If you're evaluating real-time mode, start with a stateless Kafka-to-Kafka pipeline that currently runs on microbatch. Swap the trigger, compare latencies, and assess whether the at-least-once trade-off works for your use case. That's the lowest-risk way to see the latency improvement firsthand.

For the full Spark 4.1 feature set beyond streaming, see the Spark 4.1 release highlights. If you're still on Spark 3.x, start with the migration guide — you'll need Spark 4.1 to use real-time mode.