Reading a JSON File Where the Schema Varies Between Records in Spark Scala

Event streams, webhook dumps, and log files rarely have a uniform shape — one record carries x and y coordinates, the next has a price, the next an email. Spark's JSON reader handles this gracefully, but the rules for how it handles it are worth knowing before you ship a pipeline. This tutorial walks through schema inference's union behavior, what happens when the same field has different types across records, supplying an explicit schema to stay defensive, capturing malformed rows with _corrupt_record, and the "keep it as raw JSON" escape hatch for genuinely heterogeneous nested payloads.

What Spark Does by Default

When you call spark.read.json(...) against a Dataset of strings (or a path), Spark scans the input twice: once to infer the schema by visiting every record, and once to read the rows. During inference it unions the fields seen across all records, so the final schema contains every field that appeared in any record. Fields a record didn't have come out as null.

Here are three events that share an event_id and action but otherwise have nothing in common:

val json = Seq(
  """{"event_id":"evt-001","action":"click","x":120,"y":340}""",
  """{"event_id":"evt-002","action":"purchase","item":"keyboard","price":89.0}""",
  """{"event_id":"evt-003","action":"signup","email":"alice@example.com"}""",
).toDS

val df = spark.read.json(json)

df.printSchema()
// root
//  |-- action: string (nullable = true)
//  |-- email: string (nullable = true)
//  |-- event_id: string (nullable = true)
//  |-- item: string (nullable = true)
//  |-- price: double (nullable = true)
//  |-- x: long (nullable = true)
//  |-- y: long (nullable = true)

df.show(false)
// +--------+-----------------+--------+--------+-----+----+----+
// |action  |email            |event_id|item    |price|x   |y   |
// +--------+-----------------+--------+--------+-----+----+----+
// |click   |null             |evt-001 |null    |null |120 |340 |
// |purchase|null             |evt-002 |keyboard|89.0 |null|null|
// |signup  |alice@example.com|evt-003 |null    |null |null|null|
// +--------+-----------------+--------+--------+-----+----+----+

Two things worth pointing out. First, every column is nullable = true — even the event_id that appears in every record. Schema inference can't prove a field is always present, so it conservatively marks everything nullable. Second, the columns come out alphabetically sorted rather than in source order. That's surprising the first time, but it's just how the inference path orders fields. If you need a specific column order in the output, follow the read with a select.

(For a refresher on creating these small DataFrames inline, see creating DataFrames in Spark Scala for testing with toDF.)

This default works fine for ad-hoc exploration. The cost is the extra full pass over the data for inference, plus the fact that your schema silently changes if the input adds or drops fields — a brittle property for anything that runs in production.

When the Same Field Has Different Types

The interesting case is when records disagree about a field's type, not just whether it's present. Consider a user_id that's a number in some records and a string in others:

val json = Seq(
  """{"event_id":"evt-001","user_id":42}""",
  """{"event_id":"evt-002","user_id":"u-17"}""",
  """{"event_id":"evt-003","user_id":91}""",
).toDS

val df = spark.read.json(json)

df.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- user_id: string (nullable = true)

df.show(false)
// +--------+-------+
// |event_id|user_id|
// +--------+-------+
// |evt-001 |42     |
// |evt-002 |u-17   |
// |evt-003 |91     |
// +--------+-------+

Spark widened the type to string to fit both shapes. The numeric 42 from the first record came out as "42", a string. This is Spark being helpful — it didn't drop the conflicting rows or fail the read — but the silent type change is the kind of thing that breaks downstream code that assumed user_id was a number. Knowing this is the behavior is half the battle.

The same widening rule applies to numeric types: a column that's long in some records and double in others comes out as double. Mixed scalar/array shapes don't widen — they end up as strings too.

Locking the Schema Down

For production pipelines, the right move is almost always to supply an explicit schema. You get three benefits at once: no extra inference pass, no surprise widening, and any field you don't declare gets dropped from the read entirely. Records that didn't include a declared field show up with null for that field, which is exactly the behavior you want.

val schema = StructType(Seq(
  StructField("event_id", StringType, nullable = false),
  StructField("action", StringType, nullable = true),
  StructField("item", StringType, nullable = true),
  StructField("price", DoubleType, nullable = true),
))

val json = Seq(
  """{"event_id":"evt-001","action":"click","x":120,"y":340}""",
  """{"event_id":"evt-002","action":"purchase","item":"keyboard","price":89.0}""",
  """{"event_id":"evt-003","action":"signup","email":"alice@example.com"}""",
).toDS

val df = spark.read.schema(schema).json(json)

df.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- action: string (nullable = true)
//  |-- item: string (nullable = true)
//  |-- price: double (nullable = true)

df.show(false)
// +--------+--------+--------+-----+
// |event_id|action  |item    |price|
// +--------+--------+--------+-----+
// |evt-001 |click   |null    |null |
// |evt-002 |purchase|keyboard|89.0 |
// |evt-003 |signup  |null    |null |
// +--------+--------+--------+-----+

Notice the x, y, and email fields from the source records aren't in the output — they weren't declared. The columns come out in the order you declared them, not alphabetical. And event_id is nullable = true in the printed schema even though we declared it nullable = false: Spark's JSON reader currently ignores the nullable flag on read and treats every field as nullable. That's not a bug you can fix from the call site, but it's worth knowing if you've been relying on the flag for downstream invariants — enforce non-nullability with a .filter(col("event_id").isNotNull) after the read instead.

This pattern is also dramatically faster than inference for large inputs: Spark skips the first pass and goes straight to reading.

Catching Malformed Records

In real data, some records will be unparseable — truncated lines, manual edits gone wrong, broken upstream serializers. By default Spark's JSON reader runs in PERMISSIVE mode, which turns each unparseable record into a row of all nulls and (if you ask for it) tucks the original raw text into a column named _corrupt_record.

To capture the malformed text, declare a _corrupt_record field of type string in your schema and set the columnNameOfCorruptRecord option to match:

val schema = StructType(Seq(
  StructField("event_id", StringType, nullable = true),
  StructField("action", StringType, nullable = true),
  StructField("_corrupt_record", StringType, nullable = true),
))

val json = Seq(
  """{"event_id":"evt-001","action":"click"}""",
  """{"event_id":"evt-002",,,}""",
  """{"event_id":"evt-003","action":"signup"}""",
).toDS

val df = spark.read
  .option("mode", "PERMISSIVE")
  .option("columnNameOfCorruptRecord", "_corrupt_record")
  .schema(schema)
  .json(json)

df.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- action: string (nullable = true)
//  |-- _corrupt_record: string (nullable = true)

df.show(false)
// +--------+------+-------------------------+
// |event_id|action|_corrupt_record          |
// +--------+------+-------------------------+
// |evt-001 |click |null                     |
// |null    |null  |{"event_id":"evt-002",,,}|
// |evt-003 |signup|null                     |
// +--------+------+-------------------------+

The middle row's JSON was syntactically broken, so its parsed columns are null and the raw text is stashed in _corrupt_record. From here you can split the DataFrame into two — df.filter(col("_corrupt_record").isNotNull) for the bad rows to log or quarantine, df.filter(col("_corrupt_record").isNull).drop("_corrupt_record") for the good ones to push downstream.

Two alternative modes exist if quarantining isn't what you want:

DROPMALFORMED silently drops unparseable records. Fast and simple, but you lose visibility into what was wrong.
FAILFAST throws on the first bad record. Useful when you want a pipeline run to fail loudly rather than process partial data.

PERMISSIVE is the default and the right choice when you want to keep the run going while still capturing what went wrong.

Keeping a Varying Nested Payload as Raw JSON

Sometimes the schema variation is inside a field rather than across top-level keys — an event envelope is stable, but the payload it wraps looks different for every event type. Inferring a struct for payload ends up with the union-of-everything pattern from the first example, and the wide null-filled struct is awkward to work with.

A useful escape hatch is to read the data, then immediately re-serialize the variable part back to a JSON string. The envelope stays strongly typed, and the payload becomes a string column you can parse with from_json against a more specific schema once you know what event type each row is.

val json = Seq(
  """{"event_id":"evt-001","payload":{"action":"click","x":120,"y":340}}""",
  """{"event_id":"evt-002","payload":{"action":"purchase","item":"keyboard","price":89.0}}""",
  """{"event_id":"evt-003","payload":{"action":"signup","email":"alice@example.com"}}""",
).toDS

val df = spark.read.json(json)
  .select(col("event_id"), to_json(col("payload")).as("payload_json"))

df.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- payload_json: string (nullable = true)

df.show(false)
// +--------+----------------------------------------------------+
// |event_id|payload_json                                        |
// +--------+----------------------------------------------------+
// |evt-001 |{"action":"click","x":120,"y":340}                  |
// |evt-002 |{"action":"purchase","item":"keyboard","price":89.0}|
// |evt-003 |{"action":"signup","email":"alice@example.com"}     |
// +--------+----------------------------------------------------+

Now you can branch on payload_json by event type and apply a different from_json schema to each branch, instead of trying to fit every payload variant into one schema. (For the inverse — packing flat columns back into a JSON string — see converting a struct column to JSON.)

If you do want to flatten what Spark inferred rather than re-serialize it, the recursive flatten pattern in flattening deeply nested structs handles arbitrary depth.

Which Approach to Use

Situation	Approach
Ad-hoc exploration, you trust the input	Default `spark.read.json(path)`
Production pipeline, fields you care about are known	Explicit `schema(...)`
Need to quarantine or audit bad rows	Explicit schema + `_corrupt_record` + `PERMISSIVE`
Stable envelope wrapping a wildly varying payload	Read once, `to_json` the varying part, parse per-type later
Want the run to fail loudly on bad input	Explicit schema + `mode = FAILFAST`

The throughline is: schema inference is a great affordance for exploration, and a liability in production. Once you know what fields you actually need, write them down in a StructType and let Spark enforce it on every read. Records that don't match the schema then have exactly two fates — null-filled columns for missing fields, or the _corrupt_record bucket — and both are easy to reason about.