Using when Chains vs Map Lookups for Value Mapping in Spark Scala

Value mapping — translating one set of codes into another — comes up constantly in data pipelines. Spark gives you two clean ways to do it: chained when/otherwise expressions and Map lookups with typedLit. Each has strengths, and picking the right one depends on what you're mapping.

The when Chain Approach

The most common approach is chaining when expressions. Each .when() tests a condition and returns a value if it matches. The final .otherwise() provides a default.

val df = Seq(
  ("Alice", "US"),
  ("Bob", "GB"),
  ("Charlie", "DE"),
  ("Diana", "FR"),
  ("Eve", "JP"),
).toDF("name", "country_code")

val result = df.withColumn("country_name",
  when(col("country_code") === "US", lit("United States"))
  .when(col("country_code") === "GB", lit("United Kingdom"))
  .when(col("country_code") === "DE", lit("Germany"))
  .when(col("country_code") === "FR", lit("France"))
  .when(col("country_code") === "JP", lit("Japan"))
  .otherwise(lit("Unknown"))
)

result.show(false)
// +-------+------------+--------------+
// |name   |country_code|country_name  |
// +-------+------------+--------------+
// |Alice  |US          |United States |
// |Bob    |GB          |United Kingdom|
// |Charlie|DE          |Germany       |
// |Diana  |FR          |France        |
// |Eve    |JP          |Japan         |
// +-------+------------+--------------+

This works, and for a handful of values it's perfectly fine. But notice the repetition — every line follows the exact same when(col("country_code") === "X", lit("Y")) pattern. With five entries it's tolerable. With twenty, it's a wall of noise.

The Map Lookup Approach

Instead of writing a when for each value, you can define a Scala Map and pass it into Spark as a literal column using typedLit. Then you look up values by indexing into it.

val df = Seq(
  ("Alice", "US"),
  ("Bob", "GB"),
  ("Charlie", "DE"),
  ("Diana", "FR"),
  ("Eve", "JP"),
).toDF("name", "country_code")

val countryMap = typedLit(Map(
  "US" -> "United States",
  "GB" -> "United Kingdom",
  "DE" -> "Germany",
  "FR" -> "France",
  "JP" -> "Japan",
))

val result = df.withColumn("country_name",
  coalesce(countryMap(col("country_code")), lit("Unknown"))
)

result.show(false)
// +-------+------------+--------------+
// |name   |country_code|country_name  |
// +-------+------------+--------------+
// |Alice  |US          |United States |
// |Bob    |GB          |United Kingdom|
// |Charlie|DE          |Germany       |
// |Diana  |FR          |France        |
// |Eve    |JP          |Japan         |
// +-------+------------+--------------+

Same result, but the mapping is defined as data — a plain Scala Map — rather than as code. typedLit converts the Map into a Spark MapType column literal, and countryMap(col("country_code")) looks up the value. When a key isn't found, the map returns null, so we wrap it in coalesce to provide a default.

Handling Missing Keys and Nulls

When using the map approach, keys that don't exist in the map return null. The same goes for null input values. The coalesce wrapper catches both cases cleanly.

val df = Seq(
  ("Alice", "US"),
  ("Bob", "GB"),
  ("Charlie", "MX"),
  ("Diana", null),
  ("Eve", "JP"),
).toDF("name", "country_code")

val countryMap = typedLit(Map(
  "US" -> "United States",
  "GB" -> "United Kingdom",
  "DE" -> "Germany",
  "FR" -> "France",
  "JP" -> "Japan",
))

val result = df.withColumn("country_name",
  coalesce(countryMap(col("country_code")), lit("Unknown"))
)

result.show(false)
// +-------+------------+--------------+
// |name   |country_code|country_name  |
// +-------+------------+--------------+
// |Alice  |US          |United States |
// |Bob    |GB          |United Kingdom|
// |Charlie|MX          |Unknown       |
// |Diana  |null        |Unknown       |
// |Eve    |JP          |Japan         |
// +-------+------------+--------------+

Charlie's "MX" isn't in the map, so the lookup returns null and coalesce falls through to "Unknown". Diana's null country code does the same. No special handling needed — the pattern just works.

If you're not familiar with how null behaves in Spark comparisons, it's worth understanding. The when chain approach would also need an explicit .otherwise() or .isNull check to handle null inputs correctly.

Reusing a Map Across Multiple Columns

One advantage of the map approach is that you can reuse the same map — or define multiple related maps — without duplicating logic. This is useful when a single source value drives several derived columns.

val df = Seq(
  (1, "pending"),
  (2, "shipped"),
  (3, "delivered"),
  (4, "returned"),
  (5, "cancelled"),
).toDF("order_id", "status")

val statusLabels = typedLit(Map(
  "pending"   -> "Order Placed",
  "shipped"   -> "In Transit",
  "delivered" -> "Delivered",
  "returned"  -> "Return Processing",
  "cancelled" -> "Cancelled",
))

val statusPriority = typedLit(Map(
  "pending"   -> 1,
  "shipped"   -> 2,
  "delivered" -> 3,
  "returned"  -> 4,
  "cancelled" -> 5,
))

val result = df
  .withColumn("status_label", statusLabels(col("status")))
  .withColumn("sort_order", statusPriority(col("status")))

result.show(false)
// +--------+---------+-----------------+----------+
// |order_id|status   |status_label     |sort_order|
// +--------+---------+-----------------+----------+
// |1       |pending  |Order Placed     |1         |
// |2       |shipped  |In Transit       |2         |
// |3       |delivered|Delivered        |3         |
// |4       |returned |Return Processing|4         |
// |5       |cancelled|Cancelled        |5         |
// +--------+---------+-----------------+----------+

Doing this with when chains would mean writing two separate chains with the same conditions. The map approach keeps the mapping data in one place and lets you derive as many columns as you need from it.

When to Use when Instead

Maps are great for simple key-to-value lookups, but when chains can do things maps can't — specifically, range-based and conditional logic.

val df = Seq(
  ("Alice", 95000),
  ("Bob", 72000),
  ("Charlie", 145000),
  ("Diana", 52000),
  ("Eve", 110000),
).toDF("name", "salary")

val result = df.withColumn("band",
  when(col("salary") >= 120000, lit("Senior"))
  .when(col("salary") >= 80000, lit("Mid"))
  .when(col("salary") >= 60000, lit("Junior"))
  .otherwise(lit("Entry"))
)

result.show(false)
// +-------+------+------+
// |name   |salary|band  |
// +-------+------+------+
// |Alice  |95000 |Mid   |
// |Bob    |72000 |Junior|
// |Charlie|145000|Senior|
// |Diana  |52000 |Entry |
// |Eve    |110000|Mid   |
// +-------+------+------+

You can't express >= comparisons, pattern matching, or multi-column conditions in a map lookup. Whenever your mapping involves anything beyond exact equality on a single column, when is the right tool.

Which to Choose

Scenario	Approach
Exact value-to-value mapping (codes, statuses, labels)	Map lookup
More than ~5 values to map	Map lookup
Range-based conditions (`>=`, `<`, `between`)	`when` chain
Multi-column conditions	`when` chain
Same mapping reused across multiple derived columns	Map lookup
2-3 simple conditions	Either — `when` is fine

The rule of thumb: if you're writing more than a few when(col("x") === "a", lit("b")) lines that all follow the same pattern, a map is cleaner. If your conditions involve anything more complex than exact equality, reach for when.