Job Board
Consulting

Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala

JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.

A Deeply Nested Starting Point

To have something to flatten, here's a DataFrame built from JSON event records. Each event has a top-level user struct, which itself contains a profile struct — so user.profile.name lives two levels deep.

val json = Seq(
  """{"event_id":"evt-001","ts":"2026-04-01T10:15:00Z","user":{"id":42,"profile":{"name":"Alice","city":"Berlin"}},"device":{"os":"iOS","version":"17.2"}}""",
  """{"event_id":"evt-002","ts":"2026-04-01T10:16:23Z","user":{"id":17,"profile":{"name":"Bob","city":"Lisbon"}},"device":{"os":"Android","version":"14"}}""",
  """{"event_id":"evt-003","ts":"2026-04-01T10:18:09Z","user":{"id":91,"profile":{"name":"Charlie","city":"Tokyo"}},"device":{"os":"iOS","version":"17.1"}}""",
).toDS

val df = spark.read.json(json)

df.printSchema()
// root
//  |-- device: struct (nullable = true)
//  |    |-- os: string (nullable = true)
//  |    |-- version: string (nullable = true)
//  |-- event_id: string (nullable = true)
//  |-- ts: string (nullable = true)
//  |-- user: struct (nullable = true)
//  |    |-- id: long (nullable = true)
//  |    |-- profile: struct (nullable = true)
//  |    |    |-- city: string (nullable = true)
//  |    |    |-- name: string (nullable = true)

df.show(false)
// +-------------+--------+--------------------+----------------------+
// |device       |event_id|ts                  |user                  |
// +-------------+--------+--------------------+----------------------+
// |{iOS, 17.2}  |evt-001 |2026-04-01T10:15:00Z|{42, {Berlin, Alice}} |
// |{Android, 14}|evt-002 |2026-04-01T10:16:23Z|{17, {Lisbon, Bob}}   |
// |{iOS, 17.1}  |evt-003 |2026-04-01T10:18:09Z|{91, {Tokyo, Charlie}}|
// +-------------+--------+--------------------+----------------------+

The nested fields are queryable as-is — you can write df.filter(col("user.profile.city") === "Berlin") and it works. But once you need to group by a nested field, join on one, or hand the DataFrame to a downstream tool that expects flat tabular data, it's time to flatten.

Selecting Individual Nested Fields

The most straightforward approach is to pick out exactly the leaf fields you want, naming each one yourself. Dotted paths work in col() and any place a column reference is accepted.

val flat = df.select(
  col("event_id"),
  col("ts"),
  col("user.id").as("user_id"),
  col("device.os").as("device_os"),
  col("device.version").as("device_version"),
)

flat.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- ts: string (nullable = true)
//  |-- user_id: long (nullable = true)
//  |-- device_os: string (nullable = true)
//  |-- device_version: string (nullable = true)

flat.show(false)
// +--------+--------------------+-------+---------+--------------+
// |event_id|ts                  |user_id|device_os|device_version|
// +--------+--------------------+-------+---------+--------------+
// |evt-001 |2026-04-01T10:15:00Z|42     |iOS      |17.2          |
// |evt-002 |2026-04-01T10:16:23Z|17     |Android  |14            |
// |evt-003 |2026-04-01T10:18:09Z|91     |iOS      |17.1          |
// +--------+--------------------+-------+---------+--------------+

This is the right tool when you only need a handful of the nested fields — you stay in control of which columns end up in the output and exactly what they're called. The cost is verbosity: every leaf you want has to be named twice, once as a path and once as an alias.

One-Level Flatten with .*

When you want every field of a struct, Spark supports a star expansion: col("user.*") expands to every field inside the user struct. It saves you from listing them by hand.

val flat = df.select(col("event_id"), col("ts"), col("user.*"), col("device.*"))

flat.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- ts: string (nullable = true)
//  |-- id: long (nullable = true)
//  |-- profile: struct (nullable = true)
//  |    |-- city: string (nullable = true)
//  |    |-- name: string (nullable = true)
//  |-- os: string (nullable = true)
//  |-- version: string (nullable = true)

flat.show(false)
// +--------+--------------------+---+----------------+-------+-------+
// |event_id|ts                  |id |profile         |os     |version|
// +--------+--------------------+---+----------------+-------+-------+
// |evt-001 |2026-04-01T10:15:00Z|42 |{Berlin, Alice} |iOS    |17.2   |
// |evt-002 |2026-04-01T10:16:23Z|17 |{Lisbon, Bob}   |Android|14     |
// |evt-003 |2026-04-01T10:18:09Z|91 |{Tokyo, Charlie}|iOS    |17.1   |
// +--------+--------------------+---+----------------+-------+-------+

Two things to notice. First, .* only goes one level deep — user.profile is still a struct in the output. Second, the field names lose their parent context. user.id came out just as id, device.os as os. That's fine here, but it doesn't scale to schemas where two sibling structs share field names. (More on that in the last section.)

To fully flatten with this approach you'd need to call .select(col("profile.*"), ...) again on the result. That works, but you have to know the schema ahead of time. For arbitrary depth, a function is cleaner.

A Recursive Flatten Function

The trick is to walk the schema, build a flat list of Column references, and select them all at once. Each leaf column gets a name derived from its full dotted path, with the dots replaced by underscores so the resulting names are valid identifiers.

def flattenSchema(schema: StructType, prefix: String = ""): Seq[Column] = {
  schema.fields.flatMap { field =>
    val path = if (prefix.isEmpty) field.name else s"$prefix.${field.name}"
    field.dataType match {
      case s: StructType => flattenSchema(s, path)
      case _ => Seq(col(path).as(path.replace(".", "_")))
    }
  }
}

val flat = df.select(flattenSchema(df.schema): _*)

flat.printSchema()
// root
//  |-- device_os: string (nullable = true)
//  |-- device_version: string (nullable = true)
//  |-- event_id: string (nullable = true)
//  |-- ts: string (nullable = true)
//  |-- user_id: long (nullable = true)
//  |-- user_profile_city: string (nullable = true)
//  |-- user_profile_name: string (nullable = true)

flat.show(false)
// +---------+--------------+--------+--------------------+-------+-----------------+-----------------+
// |device_os|device_version|event_id|ts                  |user_id|user_profile_city|user_profile_name|
// +---------+--------------+--------+--------------------+-------+-----------------+-----------------+
// |iOS      |17.2          |evt-001 |2026-04-01T10:15:00Z|42     |Berlin           |Alice            |
// |Android  |14            |evt-002 |2026-04-01T10:16:23Z|17     |Lisbon           |Bob              |
// |iOS      |17.1          |evt-003 |2026-04-01T10:18:09Z|91     |Tokyo            |Charlie          |
// +---------+--------------+--------+--------------------+-------+-----------------+-----------------+

How it works:

  • flattenSchema takes a StructType and a prefix representing the path to the current struct ("" at the top level).
  • For each field, it builds the dotted path. If the field is itself a StructType, it recurses with the new prefix. Otherwise it emits a Column reference with the dotted path, aliased to the underscore form.
  • The result is a Seq[Column] you can splat into select with : _*.

Every leaf in user.profile ends up as user_profile_*. The underscore-joined names preserve the full path, which is what makes this approach safe against name collisions.

This function handles arrays of structs by treating them as leaves — col("orders").as("orders") rather than descending into them. That's usually the right default: flattening an array would change the row count, which is a different operation (you'd use explode for that). If your nested data has arrays you need to expand, do that step separately, then call flattenSchema on the result.

The Name-Collision Trap

Here's the failure mode that bites people who reach for .* without thinking about it. Two sibling structs each have a field called name:

val df = Seq(("evt-001"), ("evt-002")).toDF("event_id")
  .withColumn("user", struct(lit("Alice").as("name"), lit(42).as("id")))
  .withColumn("device", struct(lit("iPhone").as("name"), lit("iOS").as("os")))

df.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- user: struct (nullable = false)
//  |    |-- name: string (nullable = false)
//  |    |-- id: integer (nullable = false)
//  |-- device: struct (nullable = false)
//  |    |-- name: string (nullable = false)
//  |    |-- os: string (nullable = false)

val collided = df.select(col("event_id"), col("user.*"), col("device.*"))

collided.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- name: string (nullable = false)
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = false)
//  |-- os: string (nullable = false)

collided.show(false)
// +--------+-----+---+------+---+
// |event_id|name |id |name  |os |
// +--------+-----+---+------+---+
// |evt-001 |Alice|42 |iPhone|iOS|
// |evt-002 |Alice|42 |iPhone|iOS|
// +--------+-----+---+------+---+

Spark didn't raise an error. It produced a DataFrame with two columns both called name. Calling flat.select("name") from here will fail with an ambiguity error, and any downstream code that assumes unique column names is going to behave unpredictably.

The recursive function avoids this because it prefixes every leaf with its full path:

val prefixed = df.select(flattenSchema(df.schema): _*)

prefixed.printSchema()
// root
//  |-- event_id: string (nullable = true)
//  |-- user_name: string (nullable = false)
//  |-- user_id: integer (nullable = false)
//  |-- device_name: string (nullable = false)
//  |-- device_os: string (nullable = false)

prefixed.show(false)
// +--------+---------+-------+-----------+---------+
// |event_id|user_name|user_id|device_name|device_os|
// +--------+---------+-------+-----------+---------+
// |evt-001 |Alice    |42     |iPhone     |iOS      |
// |evt-002 |Alice    |42     |iPhone     |iOS      |
// +--------+---------+-------+-----------+---------+

user_name and device_name are unambiguous, and the column origins are still readable.

Which Approach to Use

  • Explicit select(col("a.b.c").as("...")) — when you only need a few of the nested fields, and want full control over the output names. Most production pipelines end up here.
  • col("foo.*") star expansion — quick way to grab everything one level down. Fine for ad-hoc exploration when you know the field names are unique.
  • Recursive flattenSchema — when the structure is deep, varies between datasets, or you need every leaf without enumerating them by hand. Drop this into a utility module and you'll reach for it often.

For the reverse operation — taking flat columns and packing them back into a struct — see Converting a Struct Column to JSON. It uses the same struct() builder pattern shown in the name-collision example above.

Tutorial Details

Created: 2026-05-16 10:26:49 PM

Last Updated: 2026-05-16 10:26:49 PM