Spark Scala Width Bucket
width_bucket assigns a numeric value to an equiwidth histogram bucket given a range and a bucket count. It's the right tool when you need to bin continuous values into fixed-size groups — age brackets, price tiers, score ranges — without writing a chain of when expressions.
width_bucket is a Spark SQL function. It isn't available directly in the org.apache.spark.sql.functions object, so you call it through expr():
def width_bucket(value, min_value, max_value, num_bucket): Column — via expr()
The function divides the range [min_value, max_value) into num_bucket equal-width intervals and returns the 1-based bucket index that contains value. Values below min_value return 0 and values at or above max_value return num_bucket + 1, so the two overflow buckets let you spot outliers without dropping them.
The width_bucket function first appeared in version 3.1.0.
Here's a basic example bucketing ages into six 10-year brackets from 20 to 80:
val df = Seq(
("Alice", 23.0),
("Bob", 31.0),
("Carol", 47.0),
("Dave", 58.0),
("Eve", 72.0),
("Frank", 89.0),
).toDF("name", "age")
val df2 = df
.withColumn("age_bucket", expr("width_bucket(age, 20, 80, 6)"))
df2.show(false)
// +-----+----+----------+
// |name |age |age_bucket|
// +-----+----+----------+
// |Alice|23.0|1 |
// |Bob |31.0|2 |
// |Carol|47.0|3 |
// |Dave |58.0|4 |
// |Eve |72.0|6 |
// |Frank|89.0|7 |
// +-----+----+----------+
Each bucket is 10 years wide: bucket 1 covers [20, 30), bucket 2 covers [30, 40), and so on through bucket 6 which covers [70, 80). Frank is 89 — past the max — so he lands in bucket 7, the overflow bucket. If anyone were under 20, they'd land in bucket 0.
Price Tiers
A common use is grouping products into price tiers for reporting. Pick the range and bucket count to match the buckets you want, and the function handles the arithmetic:
val df = Seq(
("Widget", 12.99),
("Gadget", 34.50),
("Gizmo", 59.95),
("Doohickey", 89.00),
("Thingamajig", 149.99),
("Whatsit", 299.00),
).toDF("product", "price")
val df2 = df
.withColumn("price_tier", expr("width_bucket(price, 0, 200, 4)"))
df2.show(false)
// +-----------+------+----------+
// |product |price |price_tier|
// +-----------+------+----------+
// |Widget |12.99 |1 |
// |Gadget |34.5 |1 |
// |Gizmo |59.95 |2 |
// |Doohickey |89.0 |2 |
// |Thingamajig|149.99|3 |
// |Whatsit |299.0 |5 |
// +-----------+------+----------+
The four tiers split the [0, 200) range into 50-wide buckets. Whatsit at $299 sits past the top of the range and lands in bucket 5, marking it as an outlier worth a closer look.
Boundary Behavior
The min boundary is inclusive and the max boundary is exclusive, which is worth understanding before you set your range. A value exactly equal to max_value is treated as overflow, not as the top bucket:
val df = Seq(
("below", 5.0),
("at_min", 10.0),
("inside_1", 25.0),
("inside_2", 55.0),
("at_max", 100.0),
("above", 150.0),
).toDF("label", "value")
val df2 = df
.withColumn("bucket", expr("width_bucket(value, 10, 100, 3)"))
df2.show(false)
// +--------+-----+------+
// |label |value|bucket|
// +--------+-----+------+
// |below |5.0 |0 |
// |at_min |10.0 |1 |
// |inside_1|25.0 |1 |
// |inside_2|55.0 |2 |
// |at_max |100.0|4 |
// |above |150.0|4 |
// +--------+-----+------+
at_min (10.0) lands in bucket 1, but at_max (100.0) lands in bucket 4 — the overflow bucket — alongside the value of 150.0. If you want the max value to fall into the top bucket, set max_value slightly above your data's true maximum.
Null Handling
When the input value is null, the result is null. The bounds and bucket count must be non-null, but the value column itself propagates nulls cleanly:
val df = Seq(
("Alice", Some(42.0)),
("Bob", None),
("Carol", Some(78.0)),
).toDF("name", "score")
val df2 = df
.withColumn("bucket", expr("width_bucket(score, 0, 100, 5)"))
df2.show(false)
// +-----+-----+------+
// |name |score|bucket|
// +-----+-----+------+
// |Alice|42.0 |3 |
// |Bob |null |null |
// |Carol|78.0 |4 |
// +-----+-----+------+
Related Functions
For grouping rows into equal-count buckets (quantiles) within a window rather than equal-width ranges, see ntile. For approximate percentile-based bucketing, see percentile_approx. For hash-based bucketing where values are distributed evenly across a fixed number of buckets, see pmod.