Apache Gluten: Supercharging Spark with Native C++ Execution

Apache Gluten graduated to an ASF Top-Level Project in March 2026. It pushes Spark's physical operators down to native C++ engines (Velox or ClickHouse) via Substrait and JNI, keeping the JVM in charge of scheduling while the heavy lifting happens off-heap. Here's the architecture, how to wire it up on a Scala workload, and where the rough edges are.

What Gluten Actually Does

Spark's JVM is great at distributed scheduling and not particularly great at single-node compute. A Parquet scan, a hash aggregate, a hash join — those are the kinds of operations that look almost identical to what DuckDB, ClickHouse, or any modern columnar engine does in C++, and the C++ version is consistently faster per core.

Gluten is a plugin that sits inside a normal Spark driver and rewrites the physical plan before execution. Operators that have a native equivalent get converted to a Substrait plan and shipped over JNI to a native backend. Operators that don't get a native equivalent fall back to the regular JVM whole-stage codegen. Both halves run inside the same executor, exchanging data as Apache Arrow ColumnarBatch payloads.

From a Scala application's perspective nothing changes in your code. You still write spark.read.parquet(...), you still call .groupBy(...).agg(...), you still get a DataFrame back. The plan tree that runs on the executors is what's different — FileScanExec becomes NativeFileScanExec, HashAggregateExec becomes a Velox aggregate, and the Arrow batches flow through native operators instead of being decoded into InternalRow objects.

This is the same idea behind Databricks' Photon engine, except Gluten is open source, runs on stock Apache Spark, and the native backend is pluggable.

The Architecture in One Picture

There are five moving pieces inside an executor running Gluten:

Plan converter — walks Spark's physical plan and turns each supported SparkPlan node into a Substrait relation. Substrait is a cross-engine intermediate representation, the same thing that lets DuckDB and DataFusion swap query plans.
JNI layer — pushes the Substrait plan, along with split metadata, into the native backend. There's a thin C++ shim that translates Substrait into the backend's own operator tree.
Native backend — either Velox (Meta's C++ execution library, the default) or ClickHouse (the Kyligence-maintained backend, mostly used on existing ClickHouse infrastructure). Velox is what most contributors target today.
Columnar shuffle — Spark's normal shuffle assumes row-oriented data. Gluten replaces the shuffle manager so intermediate batches stay in Arrow columnar form across the network rather than being row-converted on every exchange.
Fallback bridges — RowToColumnar and ColumnarToRow operators that convert at the boundary between native and JVM stages. Every fallback costs you a copy, so keeping the fallback rate low is the main tuning lever.

The control flow — DAG scheduler, task scheduler, broadcast, accumulators, Spark UI, event logs, dynamic allocation — all of that stays in the JVM. Gluten is not a fork. It's a plugin that registers a custom physical-plan rule and a custom ShuffleManager.

The Numbers

The numbers Gluten advertises are honest TPC-H/TPC-DS runs against vanilla Spark on the same hardware:

Velox backend, single node, 3TB TPC-H-like dataset: 3.34x overall, peak 23.45x on individual queries. TPC-DS-like is 3.02x overall.
ClickHouse backend, 8 node, 1TB: 2.12x overall, peak 3.48x.

For mixed real-world workloads the rule of thumb that comes up in user reports is 1-3x faster wall-clock, 20-30% less CPU per query, 15-20% lower memory pressure. Don't expect Photon-level numbers on every query — anything heavy in JVM UDFs, Scala lambdas, or unsupported operators will fall back and the speedup disappears.

You're not paying for this with extra hardware. The Arrow data format and off-heap memory mean Gluten executors typically use less JVM heap than vanilla Spark, with the trade-off being that more memory ends up in the C++ allocator's off-heap pool.

Enabling Gluten on a Scala Job

There's no SDK change. You don't add a library to build.sbt, you don't import anything, you don't touch your application code. Gluten ships as a fat JAR you drop on the executor classpath and a handful of spark.* config keys.

The minimum config for a Velox-backed run with Spark 4.0:

spark-submit   --jars /opt/gluten/gluten-velox-bundle-spark4.0_2.13-1.6.0.jar   --conf spark.plugins=org.apache.gluten.GlutenPlugin   --conf spark.shuffle.manager=org.apache.spark.shuffle.sort.ColumnarShuffleManager   --conf spark.memory.offHeap.enabled=true   --conf spark.memory.offHeap.size=20g   --conf spark.gluten.sql.columnar.backend.lib=velox   --class com.example.MyEtlJob   my-spark-job-assembly-1.0.jar

A few notes on what each line is doing:

spark.plugins is the standard Spark plugin extension point. The Gluten plugin installs the physical-plan rewriter when the SparkContext starts.
ColumnarShuffleManager is non-negotiable. Without it, every shuffle would force a columnar-to-row conversion and the speedup evaporates.
Off-heap memory is where the Velox/ClickHouse engine allocates working memory. Allocating zero off-heap is a configuration mistake people make on day one — the native side fails to acquire memory and falls back. The rule of thumb is to give roughly 50-70% of the executor's container memory to off-heap when running Gluten.
The backend.lib switch picks Velox vs ClickHouse. Velox is the default and what new users should start with.

If you're running on Kubernetes via the official operator, the same flags go into the SparkApplication CRD's sparkConf map. If you're on YARN, they go on the command line or in spark-defaults.conf for the cluster.

What Falls Back

The first time you turn Gluten on, your Spark UI will become essential. Stages that run natively show up with Native in the operator name; stages that fall back show vanilla Spark operator names. Watching the explain plan tells you whether you actually got the speedup you wanted.

The things that reliably fall back today:

Any JVM-defined UDF — udf((s: String) => ...). The lambda lives in the JVM and there's no way to push it down. If you have a UDF in a hot path, that stage runs as vanilla Spark and incurs a columnar-to-row conversion at the boundary.
RDD operations — Gluten only rewrites DataFrame/Dataset/SQL plans. The moment you call .rdd you're back in JVM land.
Some complex-type writes — Parquet writes of nested struct/array/map are still partially JVM. This is on the 2026 roadmap to be made fully native.
Operators with low native coverage — window functions with rare frame specifications, certain ANSI-mode behaviors, and a handful of date/time functions. Each release expands the supported surface; check the operator coverage matrix on the Gluten site if you have a specific query in mind.

The practical implication for Scala teams: if you're heavy on typed Dataset transformations that get expressed as Spark functions, you'll see most of the benefit. If you're heavy on udf() or .map(...) over a Dataset, you'll see less.

The site's piece on VARIANT in Spark 4 covers one specific operator class — Variant shredding — that the Gluten team is actively working on for the 1.7 release.

When Gluten Pays Off

The workloads where Gluten consistently wins:

Parquet-heavy scan and filter workloads. This is the sweet spot. Native Parquet readers in Velox are dramatically faster than the JVM reader, and predicate pushdown is handled in C++ with vectorized decoding.
Hash-aggregate-heavy SQL. ETL jobs that group, aggregate, and write — the kind of shape that dominates production batch pipelines — translate cleanly to native operators.
Hash joins on columnar-friendly data. Especially when the build side fits in memory. The native hash table implementation is meaningfully faster than Spark's.
TPC-H/TPC-DS-shaped analytical workloads. This is what the benchmarks measure, and the speedup transfers to real workloads of similar shape.

The workloads where Gluten doesn't help much, or actively hurts:

UDF-dominated pipelines. If most of your wall-clock is spent in udf((row: Row) => ...), Gluten can't accelerate it. The fallback overhead may even make it slower.
Many small jobs. Gluten has a one-time native engine warm-up per executor. For sub-30-second jobs this is a noticeable fraction of total runtime.
Streaming workloads — Structured Streaming is currently outside Gluten's coverage, including the new real-time mode.
Tiny data. Under a few GB per executor, the JVM is fine and you don't need any of this.

The 2026 Roadmap

The official Gluten 2026 roadmap is worth skimming if you're evaluating adoption. The highlights for Scala teams:

Spark 4.0 and 4.1 GA support. Gluten 1.6.0 already runs on Spark 4.0/4.1 — the remaining work is closing out previously-disabled test suites and aligning behavior with ANSI mode (which is now the Spark 4 default).
VARIANT shredding support for the Parquet reader and writer. This pairs with what Delta Lake and Iceberg are doing for shredded Variants.
Spark 3.2 and 3.3 deprecation, JDK 8 deprecation. If you're on either of those, start budgeting for the upgrade. Gluten's modern support window is converging on Spark 3.5+ and JDK 11+ (Java 17 is what Spark 4 itself requires).
TIMESTAMP_NTZ type support at the native layer — a long-standing gap.
Bolt backend integration, ByteDance's LLVM JIT engine, as a third backend option alongside Velox and ClickHouse.
Quarterly releases (1.7, 1.8, 1.9, 2.0) cadence going forward.

Should You Try It?

If you run analytical batch workloads on Spark 3.4+ with Parquet and you've already tuned the obvious things — partitioning, AQE, broadcast hints, shuffle partition counts — Gluten is the next lever to pull. The setup cost is small (drop in a JAR, set five config keys), the rollback is trivial (remove the JAR), and the downside on a workload that doesn't fit is usually no-op rather than regression.

A reasonable evaluation plan:

[ ] Pick a representative job. Something that runs daily, takes 10+ minutes, and is Parquet-and-aggregate heavy. Not a UDF-bound job.
[ ] Run it twice. Once on stock Spark, once with Gluten enabled, same cluster size, same data. Watch the Spark UI to confirm operators are running natively.
[ ] Check the fallback rate. If most stages show Native operators, the comparison is meaningful. If half your stages fall back, fix that first or pick a different job.
[ ] Measure wall-clock and cost. A 2x wall-clock speedup typically translates to cost savings if you're auto-scaling. On fixed clusters, it translates to either faster pipelines or smaller clusters.
[ ] Stress the off-heap configuration. Tune spark.memory.offHeap.size and watch for native OOM errors. This is the most common production gotcha.

For teams running Spark on Databricks, Gluten is interesting mostly as a comparison point to Photon — and a possible exit strategy if you want similar native-execution speedups without the Databricks dependency. The open source vs Databricks decision shifts when one of the things you'd be giving up has a credible open-source equivalent.

For everyone else running open-source Spark, Gluten is the most consequential performance story of 2026. ASF graduation alongside Polaris signals the project has the governance and contributor base to stick around — Intel, Kyligence, Alibaba Cloud, Meituan, BIGO, Microsoft, IBM, and Google are all actively contributing. The bet that native execution is the future of Spark looks correct, and this is the open-source way to take that bet today.

For the full project documentation, see the Apache Gluten site and the GitHub repository.