Adding a Row Number Without a Natural Ordering Column in Spark Scala
Spark DataFrames have no built-in row order, so adding a row number isn't as simple as it is in SQL or pandas. When you don't have a column to sort by, you need a strategy to manufacture one. This tutorial walks through the options and their trade-offs.
The Problem
Window functions like row_number() require an orderBy clause. If your data doesn't have a natural ordering column — no timestamp, no primary key, no sequence number — you need something to order by.
A tempting first move is to order by a constant.
val df = Seq(
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 110000),
("Diana", "Marketing", 68000),
("Eve", "Sales", 85000),
).toDF("name", "department", "salary")
val result = df.withColumn("row_num",
row_number().over(Window.orderBy(lit(1)))
)
result.show(false)
// +-------+-----------+------+-------+
// |name |department |salary|row_num|
// +-------+-----------+------+-------+
// |Alice |Engineering|95000 |1 |
// |Bob |Marketing |72000 |2 |
// |Charlie|Engineering|110000|3 |
// |Diana |Marketing |68000 |4 |
// |Eve |Sales |85000 |5 |
// +-------+-----------+------+-------+
This works, but it's fragile. Ordering by a constant means every row has the same sort key, so Spark can return them in any order. The numbers you get today might be different tomorrow — or different on a cluster versus local mode. The result is non-deterministic.
It also triggers a warning: No Partition Defined for Window operation! Moving all data to a single partition. That's Spark telling you it has to shuffle all data to one executor, which defeats parallelism on large datasets.
Using monotonically_increasing_id
Spark provides monotonically_increasing_id() — a built-in function that assigns a unique, monotonically increasing 64-bit integer to each row. It doesn't require a window or shuffle.
val df = Seq(
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 110000),
("Diana", "Marketing", 68000),
("Eve", "Sales", 85000),
).toDF("name", "department", "salary")
val result = df.withColumn("id", monotonically_increasing_id())
result.show(false)
// +-------+-----------+------+---+
// |name |department |salary|id |
// +-------+-----------+------+---+
// |Alice |Engineering|95000 |0 |
// |Bob |Marketing |72000 |1 |
// |Charlie|Engineering|110000|2 |
// |Diana |Marketing |68000 |3 |
// |Eve |Sales |85000 |4 |
// +-------+-----------+------+---+
The IDs are unique and increasing within each partition, but they are not guaranteed to be consecutive. On a multi-partition dataset, the IDs encode the partition number in the upper bits. You might see values like 0, 1, 2, 8589934592, 8589934593 — each partition starts at a different offset. This is by design and is documented in the Spark API.
If all you need is a unique identifier and don't care about gaps, monotonically_increasing_id() is the simplest and most efficient option — no shuffle, no window.
Getting Sequential 1-Based Row Numbers
When you need consecutive numbers starting from 1 — for display, pagination, or downstream systems that expect dense sequences — combine monotonically_increasing_id() with row_number().
val df = Seq(
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 110000),
("Diana", "Marketing", 68000),
("Eve", "Sales", 85000),
).toDF("name", "department", "salary")
val result = df
.withColumn("_mono_id", monotonically_increasing_id())
.withColumn("row_num", row_number().over(Window.orderBy("_mono_id")))
.drop("_mono_id")
result.show(false)
// +-------+-----------+------+-------+
// |name |department |salary|row_num|
// +-------+-----------+------+-------+
// |Alice |Engineering|95000 |1 |
// |Bob |Marketing |72000 |2 |
// |Charlie|Engineering|110000|3 |
// |Diana |Marketing |68000 |4 |
// |Eve |Sales |85000 |5 |
// +-------+-----------+------+-------+
The steps: add monotonically_increasing_id() as a temporary column, use it as the orderBy in a window-based row_number(), then drop the temporary column. The result is a clean, sequential 1-based index.
This approach still triggers the "no partition defined" warning and shuffles everything to one partition for the window operation. That's unavoidable when you need a global sequential number — Spark has to see all the rows in one place to assign consecutive values.
You can also inline monotonically_increasing_id() directly in the orderBy to skip the temporary column.
val df = Seq(
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 110000),
("Diana", "Marketing", 68000),
("Eve", "Sales", 85000),
).toDF("name", "department", "salary")
val result = df.withColumn("row_num",
row_number().over(Window.orderBy(monotonically_increasing_id()))
)
result.show(false)
// +-------+-----------+------+-------+
// |name |department |salary|row_num|
// +-------+-----------+------+-------+
// |Alice |Engineering|95000 |1 |
// |Bob |Marketing |72000 |2 |
// |Charlie|Engineering|110000|3 |
// |Diana |Marketing |68000 |4 |
// |Eve |Sales |85000 |5 |
// +-------+-----------+------+-------+
Same result, fewer lines. Either style works — use whichever reads better in your codebase.
Partitioned Row Numbers
If you need row numbers within groups — numbering employees within each department, for instance — add a partitionBy to the window. This is where the approach really shines, because the partition eliminates the global shuffle problem.
val df = Seq(
("Alice", "Engineering", 95000),
("Bob", "Marketing", 72000),
("Charlie", "Engineering", 110000),
("Diana", "Marketing", 68000),
("Eve", "Sales", 85000),
).toDF("name", "department", "salary")
val result = df
.withColumn("_mono_id", monotonically_increasing_id())
.withColumn("row_num",
row_number().over(Window.partitionBy("department").orderBy("_mono_id"))
)
.drop("_mono_id")
result.show(false)
// +-------+-----------+------+-------+
// |name |department |salary|row_num|
// +-------+-----------+------+-------+
// |Alice |Engineering|95000 |1 |
// |Charlie|Engineering|110000|2 |
// |Bob |Marketing |72000 |1 |
// |Diana |Marketing |68000 |2 |
// |Eve |Sales |85000 |1 |
// +-------+-----------+------+-------+
Each department gets its own numbering starting from 1. The partitionBy means Spark only needs to collect rows within each group, not the entire dataset. This scales much better than a global row number.
Note that the output rows are now grouped by department — that's a side effect of the window partitioning. The row numbers within each group reflect the original insertion order (via monotonically_increasing_id), not any meaningful sort.
Which Approach to Choose
| Scenario | Approach |
|---|---|
| Need a unique ID, gaps are OK | monotonically_increasing_id() alone |
| Need sequential 1-based numbers globally | row_number() over monotonically_increasing_id() |
| Need sequential numbers within groups | row_number() with partitionBy + orderBy(monotonically_increasing_id()) |
| Data has a natural ordering column | Use it directly — row_number().over(Window.orderBy("timestamp")) |
The key trade-off is between simplicity and performance. monotonically_increasing_id() alone is cheap — no shuffle, no window. Adding row_number() on top introduces a shuffle, which matters at scale. If your dataset fits comfortably in memory and you just need clean row numbers, the shuffle cost is negligible. If you're working with billions of rows, prefer monotonically_increasing_id() on its own or find a natural ordering column to avoid the global shuffle.