Unpivoting Columns to Rows with stack in Spark Scala
Wide DataFrames — where each measure lives in its own column — are common in source data but awkward to aggregate, chart, or join. The stack generator function lets you unpivot those columns into rows without leaving the DataFrame API.
The Wide Format Problem
Suppose you have quarterly revenue stored as separate columns.
val df = Seq(
("Laptop", 12000, 15000, 13500, 18000),
("Tablet", 8000, 7500, 9200, 11000),
("Monitor", 5000, 6200, 5800, 7100),
).toDF("product", "Q1", "Q2", "Q3", "Q4")
df.show(false)
// +-------+-----+-----+-----+-----+
// |product|Q1 |Q2 |Q3 |Q4 |
// +-------+-----+-----+-----+-----+
// |Laptop |12000|15000|13500|18000|
// |Tablet |8000 |7500 |9200 |11000|
// |Monitor|5000 |6200 |5800 |7100 |
// +-------+-----+-----+-----+-----+
This is fine for reading, but try grouping by quarter or computing a rolling average across quarters — you can't, because the quarter identity is buried in the column names. You need one row per product per quarter.
Unpivoting with stack
The stack function is a SQL generator — it takes a single input row and produces multiple output rows. You call it through selectExpr because it uses SQL syntax.
val df = Seq(
("Laptop", 12000, 15000, 13500, 18000),
("Tablet", 8000, 7500, 9200, 11000),
("Monitor", 5000, 6200, 5800, 7100),
).toDF("product", "Q1", "Q2", "Q3", "Q4")
val result = df.selectExpr(
"product",
"stack(4, 'Q1', Q1, 'Q2', Q2, 'Q3', Q3, 'Q4', Q4) as (quarter, revenue)"
)
result.show(false)
// +-------+-------+-------+
// |product|quarter|revenue|
// +-------+-------+-------+
// |Laptop |Q1 |12000 |
// |Laptop |Q2 |15000 |
// |Laptop |Q3 |13500 |
// |Laptop |Q4 |18000 |
// |Tablet |Q1 |8000 |
// |Tablet |Q2 |7500 |
// |Tablet |Q3 |9200 |
// |Tablet |Q4 |11000 |
// |Monitor|Q1 |5000 |
// |Monitor|Q2 |6200 |
// |Monitor|Q3 |5800 |
// |Monitor|Q4 |7100 |
// +-------+-------+-------+
Three input rows become twelve — one per product per quarter.
How stack Works
The function signature is stack(n, key1, val1, key2, val2, ..., keyN, valN).
n— the number of rows to generate per input row- key/value pairs — each pair becomes one output row. The key is typically a string literal (the column name you're unpivoting), and the value is the column reference
The as (quarter, revenue) at the end names the two output columns. Without it, Spark assigns generic names like col0 and col1.
The key insight: stack doesn't care about column names or metadata. It just takes the expressions you give it and deals them out into n rows. The string literals ('Q1', 'Q2', etc.) are how you preserve the identity of which column each value came from.
Handling Nulls After Unpivoting
When your source data has nulls, stack faithfully carries them through. Sometimes that's what you want. Other times you'd rather drop the rows where the value is null — for instance, when unpivoting sparse data where not every subject has a score.
val df = Seq(
("Alice", Some(85), Some(92), None),
("Bob", Some(78), None, Some(88)),
("Charlie", None, Some(95), Some(91)),
).toDF("student", "math", "science", "english")
val unpivoted = df.selectExpr(
"student",
"stack(3, 'math', math, 'science', science, 'english', english) as (subject, score)"
)
unpivoted.show(false)
// +-------+-------+-----+
// |student|subject|score|
// +-------+-------+-----+
// |Alice |math |85 |
// |Alice |science|92 |
// |Alice |english|null |
// |Bob |math |78 |
// |Bob |science|null |
// |Bob |english|88 |
// |Charlie|math |null |
// |Charlie|science|95 |
// |Charlie|english|91 |
// +-------+-------+-----+
Every student gets three rows regardless of whether they have a score. To keep only the rows with actual values, filter after unpivoting.
val filtered = unpivoted.filter(col("score").isNotNull)
filtered.show(false)
// +-------+-------+-----+
// |student|subject|score|
// +-------+-------+-----+
// |Alice |math |85 |
// |Alice |science|92 |
// |Bob |math |78 |
// |Bob |english|88 |
// |Charlie|science|95 |
// |Charlie|english|91 |
// +-------+-------+-----+
If you're not sure how null handling works in Spark comparisons, it's worth understanding — isNotNull is the right way to check, not =!= null.
Building the Expression Dynamically
Hard-coding every column name into a stack call works when you have a handful of columns. When you have dozens — or the columns to unpivot aren't known until runtime — you can build the expression from a list.
val df = Seq(
("New York", 8336, 8804, 19500),
("Los Angeles", 3979, 3898, 12800),
("Chicago", 2694, 2746, 11500),
).toDF("city", "pop_2020", "pop_2023", "median_income")
val popCols = Seq("pop_2020", "pop_2023")
val stackExpr = popCols
.map(c => s"'$c', $c")
.mkString(s"stack(${popCols.size}, ", ", ", ") as (census_year, population)")
val result = df.selectExpr("city", "median_income", stackExpr)
result.show(false)
// +-----------+-------------+-----------+----------+
// |city |median_income|census_year|population|
// +-----------+-------------+-----------+----------+
// |New York |19500 |pop_2020 |8336 |
// |New York |19500 |pop_2023 |8804 |
// |Los Angeles|12800 |pop_2020 |3979 |
// |Los Angeles|12800 |pop_2023 |3898 |
// |Chicago |11500 |pop_2020 |2694 |
// |Chicago |11500 |pop_2023 |2746 |
// +-----------+-------------+-----------+----------+
The popCols list drives the expression. Only those columns get unpivoted — median_income stays as a regular column. Adding or removing columns from popCols is a one-line change. This pattern also works well when you're reading the column list from config or schema metadata.
Unpivoting for Visualization
A common reason to unpivot is preparing data for charting. Visualization tools typically expect one row per data point with a column that identifies the series.
val df = Seq(
("2024-01", 150, 45, 12),
("2024-02", 175, 52, 18),
("2024-03", 210, 63, 22),
).toDF("month", "page_views", "signups", "purchases")
val result = df.selectExpr(
"month",
"stack(3, 'page_views', page_views, 'signups', signups, 'purchases', purchases) as (metric, value)"
)
result.show(false)
// +-------+----------+-----+
// |month |metric |value|
// +-------+----------+-----+
// |2024-01|page_views|150 |
// |2024-01|signups |45 |
// |2024-01|purchases |12 |
// |2024-02|page_views|175 |
// |2024-02|signups |52 |
// |2024-02|purchases |18 |
// |2024-03|page_views|210 |
// |2024-03|signups |63 |
// |2024-03|purchases |22 |
// +-------+----------+-----+
Now you can group by metric, filter to a single series, or pivot back to wide format if needed. The long format is also easier to aggregate — result.groupBy("metric").sum("value") gives you totals per metric in one line.
When to Use stack
stack is the right tool when you need to turn columns into rows — the classic unpivot operation. Use it when:
- Your source data has measures spread across columns (quarters, years, categories)
- You need long format for aggregation, joining, or visualization
- You want to keep some columns fixed while unpivoting others
If you need the reverse — turning rows into columns — look at Spark's pivot function on a GroupBy instead.