Job Board
Consulting

Converting a Map Column to Individual Columns with getItem in Spark Scala

A MapType column is convenient for carrying a bag of key/value attributes through a pipeline, but most downstream work — filtering, joining, aggregating, writing tabular output — is easier when each key has its own column. This tutorial shows how to pull values out of a map with getItem, the apply() shorthand, what happens when a key is missing, and how to expand a map whose keys you don't know until runtime.

A Map Column to Start With

Here's a small DataFrame where each row carries a Map[String, String] of user attributes. Maps like this often come from JSON ingestion, feature stores, or rows where the schema deliberately keeps an open-ended set of fields.

val df = Seq(
  (1, "Alice",   Map("city" -> "Berlin", "country" -> "DE", "tier" -> "gold")),
  (2, "Bob",     Map("city" -> "Lisbon", "country" -> "PT", "tier" -> "silver")),
  (3, "Charlie", Map("city" -> "Tokyo",  "country" -> "JP", "tier" -> "gold")),
).toDF("id", "name", "attrs")

df.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- attrs: map (nullable = true)
//  |    |-- key: string
//  |    |-- value: string (valueContainsNull = true)

df.show(false)
// +---+-------+-----------------------------------------------+
// |id |name   |attrs                                          |
// +---+-------+-----------------------------------------------+
// |1  |Alice  |{city -> Berlin, country -> DE, tier -> gold}  |
// |2  |Bob    |{city -> Lisbon, country -> PT, tier -> silver}|
// |3  |Charlie|{city -> Tokyo, country -> JP, tier -> gold}   |
// +---+-------+-----------------------------------------------+

The whole map renders into one column. To group by tier, filter on country, or write this out as a tidy Parquet file, you want each key promoted to its own column.

For the reverse operation — packing existing columns into a map — see Create Spark Map From Columns.

Extracting Values with getItem

The getItem method on a Column looks up a value by key. Aliased with .as(...), it gives you a fully named output column.

val flat = df.select(
  col("id"),
  col("name"),
  col("attrs").getItem("city").as("city"),
  col("attrs").getItem("country").as("country"),
  col("attrs").getItem("tier").as("tier"),
)

flat.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- city: string (nullable = true)
//  |-- country: string (nullable = true)
//  |-- tier: string (nullable = true)

flat.show(false)
// +---+-------+------+-------+------+
// |id |name   |city  |country|tier  |
// +---+-------+------+-------+------+
// |1  |Alice  |Berlin|DE     |gold  |
// |2  |Bob    |Lisbon|PT     |silver|
// |3  |Charlie|Tokyo |JP     |gold  |
// +---+-------+------+-------+------+

The extracted columns inherit the map's value type — here, string. Notice that every column is marked nullable in the schema, even though the source Map always had a value for every key. That's because getItem returns null whenever the key isn't present in a given row's map, so the type system has to allow it. We'll see that behavior in a moment.

The apply() Shorthand

Column also implements apply(), so you can index into the map with parentheses instead of calling getItem by name. The two are equivalent.

val flat = df.select(
  col("id"),
  col("name"),
  col("attrs")("city").as("city"),
  col("attrs")("country").as("country"),
  col("attrs")("tier").as("tier"),
)

flat.show(false)
// +---+-------+------+-------+------+
// |id |name   |city  |country|tier  |
// +---+-------+------+-------+------+
// |1  |Alice  |Berlin|DE     |gold  |
// |2  |Bob    |Lisbon|PT     |silver|
// |3  |Charlie|Tokyo |JP     |gold  |
// +---+-------+------+-------+------+

Pick whichever reads better in context. getItem("city") tends to be clearer when the key is dynamic or stored in a variable; the (...) form is more compact for a hand-written list of literal keys.

Missing Keys Become null

When you ask for a key the map doesn't contain, Spark doesn't raise an error — it returns null. This is convenient for forward-compatibility (new keys can appear without breaking existing pipelines) but easy to miss when you typo a key name.

val missing = df.select(
  col("id"),
  col("name"),
  col("attrs").getItem("city").as("city"),
  col("attrs").getItem("phone").as("phone"),
)

missing.show(false)
// +---+-------+------+-----+
// |id |name   |city  |phone|
// +---+-------+------+-----+
// |1  |Alice  |Berlin|null |
// |2  |Bob    |Lisbon|null |
// |3  |Charlie|Tokyo |null |
// +---+-------+------+-----+

phone isn't in any of the maps, so every row gets null. If your downstream code uses === "some-value" against the result, remember Spark's three-valued logic — see Why null =!= null Returns null, Not true for the full story. The short version: filter with isNotNull to skip missing values, not with =!= null.

Discovering the Keys at Runtime

The hand-written list of keys works when you know what's in the map. When the key set isn't known until runtime — for example, you're ingesting a feature dictionary whose contents are determined by config — you can discover the keys first, then build the select from them.

map_keys returns an array of the keys present in each row's map. explode turns that into one row per key, and distinct collapses duplicates across rows. Collecting back to the driver gives you the full key set.

val keys: Array[String] = df
  .select(explode(map_keys(col("attrs"))).as("k"))
  .distinct()
  .as[String]
  .collect()
  .sorted

// keys: Array[String] = Array(city, country, tier)

val keyCols: Seq[Column] = keys.toSeq.map(k => col("attrs").getItem(k).as(k))

val flat = df.select((col("id") +: col("name") +: keyCols): _*)

flat.printSchema()
// root
//  |-- id: integer (nullable = false)
//  |-- name: string (nullable = true)
//  |-- city: string (nullable = true)
//  |-- country: string (nullable = true)
//  |-- tier: string (nullable = true)

flat.show(false)
// +---+-------+------+-------+------+
// |id |name   |city  |country|tier  |
// +---+-------+------+-------+------+
// |1  |Alice  |Berlin|DE     |gold  |
// |2  |Bob    |Lisbon|PT     |silver|
// |3  |Charlie|Tokyo |JP     |gold  |
// +---+-------+------+-------+------+

The trade-off is that this triggers a job: collect() pulls the distinct keys back to the driver before the final select is even planned. For most use cases — modest key cardinality, one-shot ETL — that's fine. If the key set is very wide, or if you're iterating on a streaming source, prefer either a fixed key list pulled from config or storing a struct instead of a map.

When Map Keys Vary Between Rows

The discovery pattern shines when not every row has every key. Maps with sparse or row-dependent keys are common in event payloads where different event types carry different fields.

val df = Seq(
  (1, "Alice",   Map("city" -> "Berlin",                    "tier" -> "gold")),
  (2, "Bob",     Map("city" -> "Lisbon", "country" -> "PT", "tier" -> "silver")),
  (3, "Charlie", Map(                    "country" -> "JP", "tier" -> "gold")),
).toDF("id", "name", "attrs")

val keys: Array[String] = df
  .select(explode(map_keys(col("attrs"))).as("k"))
  .distinct()
  .as[String]
  .collect()
  .sorted

val keyCols: Seq[Column] = keys.toSeq.map(k => col("attrs").getItem(k).as(k))

val flat = df.select((col("id") +: col("name") +: keyCols): _*)

flat.show(false)
// +---+-------+------+-------+------+
// |id |name   |city  |country|tier  |
// +---+-------+------+-------+------+
// |1  |Alice  |Berlin|null   |gold  |
// |2  |Bob    |Lisbon|PT     |silver|
// |3  |Charlie|null  |JP     |gold  |
// +---+-------+------+-------+------+

Each row gets a column for every key that appears anywhere in the dataset, with null filling in the gaps. The output is a regular flat DataFrame, ready to write or aggregate.

If your records have differing top-level schemas (not just differing map keys), the related problem is reading a JSON file where the schema varies between records — that handles the layer above this one.

Which Approach to Use

  • getItem("key") or col("attrs")("key") — when you know the keys ahead of time. This is the common case and the most readable. Use it for any production pipeline where the key list is a contract.
  • Discover keys with map_keys + explode + collect — when the key set is unknown or varies per dataset. Be aware it triggers a driver-side job.
  • Switch to a struct instead of a map — if the keys really are fixed and well-known. Structs give you static schema, autocomplete, and no nullability surprises. Maps are the right choice when the key set is genuinely dynamic; structs are better when it isn't.
Tutorial Details

Created: 2026-06-06 10:25:11 PM

Last Updated: 2026-06-06 10:25:11 PM