The VARIANT Data Type in Spark 4.0: Semi-Structured Data Without Schema Headaches
Spark 4.0 added a native VARIANT type (SPARK-45827) for storing semi-structured data in a compact binary format you can query directly — no upfront schema, no from_json ceremony on every read. The published benchmarks show roughly 8x faster reads than storing the same payload as a JSON string column, and Spark 4.1 adds shredding to push that further. This article shows how the Scala API works, when to reach for VARIANT, and when you still want a strongly typed StructType.
Before Spark 4.0, every team handling JSON-shaped data ended up at the same fork in the road:
Store JSON as StringType and parse on read. Cheap to write, painful to query. Every from_json call needs a schema, and every analyst either invents one inline or pastes one from a doc that's two months out of date. The string column is opaque to the optimizer.
Pin a schema with StructType up front. Fast and type-safe, but every new field upstream is a schema migration. If the producer adds device.os.locale on Thursday, your from_json either silently drops it (returns null) or fails the job depending on your columnNameOfCorruptRecord settings.
VARIANT is the third option. It stores semi-structured data in a binary encoding that preserves the full nested structure, lets you query paths directly with dot/bracket syntax, and does not require you to declare a schema. The encoding is an open specification shared with Parquet and the Delta Lake project, so the format is not Spark-proprietary.
The performance gain comes from the fact that VARIANT is parsed once at ingest, not on every query. Databricks' published numbers (Runtime 15.0 with Photon) show 8x improvement over equivalent String columns for both nested and flat schemas. Spark 4.0 ships the core type and functions; Spark 4.1 adds shredding — projecting frequently-accessed paths into columnar storage so common queries skip the binary decode entirely.
The Core Functions
Spark 4.0 ships a small, focused set of VARIANT functions. They are all available from SQL; in Scala you reach them through expr() or selectExpr(), since most do not have direct Column method equivalents.
Function
Returns
Purpose
parse_json(str)
VARIANT
Parse a JSON string into a VARIANT value. Throws on invalid JSON.
try_parse_json(str)
VARIANT
Same, but returns null on invalid input.
variant_get(v, path, type)
typed
Extract a path and cast it to a concrete Spark type. Throws on cast failure.
try_variant_get(v, path, type)
typed
Same, but returns null on extraction or cast failure.
schema_of_variant(v)
STRING
Returns the inferred schema of a single VARIANT value as a DDL string.
schema_of_variant_agg(v)
STRING
Aggregate version — infers a unified schema across all rows.
is_variant_null(v)
BOOLEAN
Distinguishes JSON null from SQL NULL.
There is also the : colon-path syntax (raw:store.bicycle.price) which is shorthand for variant_get — handy in SQL, less useful from the Scala DSL.
Reading JSON into VARIANT
The natural entry point is parse_json. Given a string column of JSON, one call gets you a queryable VARIANT.
Note the schema: payload is now variant, a first-class type — not a struct, not a string. The three rows have three different shapes and that is fine. The VARIANT column does not need them to agree.
If your input is already non-string JSON (e.g., from a Kafka source where the value is parsed by the connector), you can build VARIANT values from arbitrary Spark expressions; parse_json is just the most common path.
Querying Nested Fields
This is where VARIANT pays off. You query paths directly, with optional type casts.
Missing paths return null. There is no exception when amount is absent on a click event or target.id is absent on a purchase. That's the point — VARIANT is for data where presence is not guaranteed.
The path uses JSONPath-style $.foo.bar syntax. Bracket indexing works for arrays: $.items[0].
The third argument is a Spark SQL type, not a Scala type. Pass it as a string ('string', 'double', 'array<string>', 'struct<id: string, page: string>').
If a value is present but cannot be cast to the requested type, variant_get throws under ANSI mode (which is now the default in Spark 4.0). When the source data is messy, swap to try_variant_get:
// Some rows have amount as a string like "49.99", some as a number, some absent.
events.selectExpr(
"try_variant_get(payload, '$.amount', 'double') AS amount"
).show()
try_variant_get is to variant_get what try_cast is to cast. Reach for it at the expression level where null is the genuine semantic; keep variant_get everywhere else so the optimizer (and your future self) can see strict type expectations.
Extracting a Struct in One Shot
Plucking individual fields gets verbose. When you know the shape you want, cast the whole VARIANT (or a subpath) to a StructType:
This is the seam between schema-less and schema-bound. Store the data as VARIANT, then project the well-known parts of the payload into a typed struct at query time. New fields that the producer adds later are still in the underlying VARIANT — you just have not asked for them yet.
Inferring a Schema
When you don't know the shape, ask Spark. schema_of_variant_agg walks every row and returns a unified DDL string.