Using when Chains vs Map Lookups for Value Mapping in Spark Scala
Value mapping — translating one set of codes into another — comes up constantly in data pipelines. Spark gives you two clean ways to do it: chained when/otherwise expressions and Map lookups with typedLit. Each has strengths, and picking the right one depends on what you're mapping.
The when Chain Approach
The most common approach is chaining when expressions. Each .when() tests a condition and returns a value if it matches. The final .otherwise() provides a default.
val df = Seq(
("Alice", "US"),
("Bob", "GB"),
("Charlie", "DE"),
("Diana", "FR"),
("Eve", "JP"),
).toDF("name", "country_code")
val result = df.withColumn("country_name",
when(col("country_code") === "US", lit("United States"))
.when(col("country_code") === "GB", lit("United Kingdom"))
.when(col("country_code") === "DE", lit("Germany"))
.when(col("country_code") === "FR", lit("France"))
.when(col("country_code") === "JP", lit("Japan"))
.otherwise(lit("Unknown"))
)
result.show(false)
// +-------+------------+--------------+
// |name |country_code|country_name |
// +-------+------------+--------------+
// |Alice |US |United States |
// |Bob |GB |United Kingdom|
// |Charlie|DE |Germany |
// |Diana |FR |France |
// |Eve |JP |Japan |
// +-------+------------+--------------+
This works, and for a handful of values it's perfectly fine. But notice the repetition — every line follows the exact same when(col("country_code") === "X", lit("Y")) pattern. With five entries it's tolerable. With twenty, it's a wall of noise.
The Map Lookup Approach
Instead of writing a when for each value, you can define a Scala Map and pass it into Spark as a literal column using typedLit. Then you look up values by indexing into it.
val df = Seq(
("Alice", "US"),
("Bob", "GB"),
("Charlie", "DE"),
("Diana", "FR"),
("Eve", "JP"),
).toDF("name", "country_code")
val countryMap = typedLit(Map(
"US" -> "United States",
"GB" -> "United Kingdom",
"DE" -> "Germany",
"FR" -> "France",
"JP" -> "Japan",
))
val result = df.withColumn("country_name",
coalesce(countryMap(col("country_code")), lit("Unknown"))
)
result.show(false)
// +-------+------------+--------------+
// |name |country_code|country_name |
// +-------+------------+--------------+
// |Alice |US |United States |
// |Bob |GB |United Kingdom|
// |Charlie|DE |Germany |
// |Diana |FR |France |
// |Eve |JP |Japan |
// +-------+------------+--------------+
Same result, but the mapping is defined as data — a plain Scala Map — rather than as code. typedLit converts the Map into a Spark MapType column literal, and countryMap(col("country_code")) looks up the value. When a key isn't found, the map returns null, so we wrap it in coalesce to provide a default.
Handling Missing Keys and Nulls
When using the map approach, keys that don't exist in the map return null. The same goes for null input values. The coalesce wrapper catches both cases cleanly.
val df = Seq(
("Alice", "US"),
("Bob", "GB"),
("Charlie", "MX"),
("Diana", null),
("Eve", "JP"),
).toDF("name", "country_code")
val countryMap = typedLit(Map(
"US" -> "United States",
"GB" -> "United Kingdom",
"DE" -> "Germany",
"FR" -> "France",
"JP" -> "Japan",
))
val result = df.withColumn("country_name",
coalesce(countryMap(col("country_code")), lit("Unknown"))
)
result.show(false)
// +-------+------------+--------------+
// |name |country_code|country_name |
// +-------+------------+--------------+
// |Alice |US |United States |
// |Bob |GB |United Kingdom|
// |Charlie|MX |Unknown |
// |Diana |null |Unknown |
// |Eve |JP |Japan |
// +-------+------------+--------------+
Charlie's "MX" isn't in the map, so the lookup returns null and coalesce falls through to "Unknown". Diana's null country code does the same. No special handling needed — the pattern just works.
If you're not familiar with how null behaves in Spark comparisons, it's worth understanding. The when chain approach would also need an explicit .otherwise() or .isNull check to handle null inputs correctly.
Reusing a Map Across Multiple Columns
One advantage of the map approach is that you can reuse the same map — or define multiple related maps — without duplicating logic. This is useful when a single source value drives several derived columns.
val df = Seq(
(1, "pending"),
(2, "shipped"),
(3, "delivered"),
(4, "returned"),
(5, "cancelled"),
).toDF("order_id", "status")
val statusLabels = typedLit(Map(
"pending" -> "Order Placed",
"shipped" -> "In Transit",
"delivered" -> "Delivered",
"returned" -> "Return Processing",
"cancelled" -> "Cancelled",
))
val statusPriority = typedLit(Map(
"pending" -> 1,
"shipped" -> 2,
"delivered" -> 3,
"returned" -> 4,
"cancelled" -> 5,
))
val result = df
.withColumn("status_label", statusLabels(col("status")))
.withColumn("sort_order", statusPriority(col("status")))
result.show(false)
// +--------+---------+-----------------+----------+
// |order_id|status |status_label |sort_order|
// +--------+---------+-----------------+----------+
// |1 |pending |Order Placed |1 |
// |2 |shipped |In Transit |2 |
// |3 |delivered|Delivered |3 |
// |4 |returned |Return Processing|4 |
// |5 |cancelled|Cancelled |5 |
// +--------+---------+-----------------+----------+
Doing this with when chains would mean writing two separate chains with the same conditions. The map approach keeps the mapping data in one place and lets you derive as many columns as you need from it.
When to Use when Instead
Maps are great for simple key-to-value lookups, but when chains can do things maps can't — specifically, range-based and conditional logic.
val df = Seq(
("Alice", 95000),
("Bob", 72000),
("Charlie", 145000),
("Diana", 52000),
("Eve", 110000),
).toDF("name", "salary")
val result = df.withColumn("band",
when(col("salary") >= 120000, lit("Senior"))
.when(col("salary") >= 80000, lit("Mid"))
.when(col("salary") >= 60000, lit("Junior"))
.otherwise(lit("Entry"))
)
result.show(false)
// +-------+------+------+
// |name |salary|band |
// +-------+------+------+
// |Alice |95000 |Mid |
// |Bob |72000 |Junior|
// |Charlie|145000|Senior|
// |Diana |52000 |Entry |
// |Eve |110000|Mid |
// +-------+------+------+
You can't express >= comparisons, pattern matching, or multi-column conditions in a map lookup. Whenever your mapping involves anything beyond exact equality on a single column, when is the right tool.
Which to Choose
| Scenario | Approach |
|---|---|
| Exact value-to-value mapping (codes, statuses, labels) | Map lookup |
| More than ~5 values to map | Map lookup |
Range-based conditions (>=, <, between) |
when chain |
| Multi-column conditions | when chain |
| Same mapping reused across multiple derived columns | Map lookup |
| 2-3 simple conditions | Either — when is fine |
The rule of thumb: if you're writing more than a few when(col("x") === "a", lit("b")) lines that all follow the same pattern, a map is cleaner. If your conditions involve anything more complex than exact equality, reach for when.