Spark Scala Width Bucket

width_bucket assigns a numeric value to an equiwidth histogram bucket given a range and a bucket count. It's the right tool when you need to bin continuous values into fixed-size groups — age brackets, price tiers, score ranges — without writing a chain of when expressions.

width_bucket is a Spark SQL function. It isn't available directly in the org.apache.spark.sql.functions object, so you call it through expr():

def width_bucket(value, min_value, max_value, num_bucket): Column — via expr()

The function divides the range [min_value, max_value) into num_bucket equal-width intervals and returns the 1-based bucket index that contains value. Values below min_value return 0 and values at or above max_value return num_bucket + 1, so the two overflow buckets let you spot outliers without dropping them.

The width_bucket function first appeared in version 3.1.0.

Here's a basic example bucketing ages into six 10-year brackets from 20 to 80:

val df = Seq(
  ("Alice",   23.0),
  ("Bob",     31.0),
  ("Carol",   47.0),
  ("Dave",    58.0),
  ("Eve",     72.0),
  ("Frank",   89.0),
).toDF("name", "age")

val df2 = df
  .withColumn("age_bucket", expr("width_bucket(age, 20, 80, 6)"))

df2.show(false)
// +-----+----+----------+
// |name |age |age_bucket|
// +-----+----+----------+
// |Alice|23.0|1         |
// |Bob  |31.0|2         |
// |Carol|47.0|3         |
// |Dave |58.0|4         |
// |Eve  |72.0|6         |
// |Frank|89.0|7         |
// +-----+----+----------+

Each bucket is 10 years wide: bucket 1 covers [20, 30), bucket 2 covers [30, 40), and so on through bucket 6 which covers [70, 80). Frank is 89 — past the max — so he lands in bucket 7, the overflow bucket. If anyone were under 20, they'd land in bucket 0.

Price Tiers

A common use is grouping products into price tiers for reporting. Pick the range and bucket count to match the buckets you want, and the function handles the arithmetic:

val df = Seq(
  ("Widget",      12.99),
  ("Gadget",      34.50),
  ("Gizmo",       59.95),
  ("Doohickey",   89.00),
  ("Thingamajig", 149.99),
  ("Whatsit",     299.00),
).toDF("product", "price")

val df2 = df
  .withColumn("price_tier", expr("width_bucket(price, 0, 200, 4)"))

df2.show(false)
// +-----------+------+----------+
// |product    |price |price_tier|
// +-----------+------+----------+
// |Widget     |12.99 |1         |
// |Gadget     |34.5  |1         |
// |Gizmo      |59.95 |2         |
// |Doohickey  |89.0  |2         |
// |Thingamajig|149.99|3         |
// |Whatsit    |299.0 |5         |
// +-----------+------+----------+

The four tiers split the [0, 200) range into 50-wide buckets. Whatsit at $299 sits past the top of the range and lands in bucket 5, marking it as an outlier worth a closer look.

Boundary Behavior

The min boundary is inclusive and the max boundary is exclusive, which is worth understanding before you set your range. A value exactly equal to max_value is treated as overflow, not as the top bucket:

val df = Seq(
  ("below",     5.0),
  ("at_min",    10.0),
  ("inside_1",  25.0),
  ("inside_2",  55.0),
  ("at_max",    100.0),
  ("above",     150.0),
).toDF("label", "value")

val df2 = df
  .withColumn("bucket", expr("width_bucket(value, 10, 100, 3)"))

df2.show(false)
// +--------+-----+------+
// |label   |value|bucket|
// +--------+-----+------+
// |below   |5.0  |0     |
// |at_min  |10.0 |1     |
// |inside_1|25.0 |1     |
// |inside_2|55.0 |2     |
// |at_max  |100.0|4     |
// |above   |150.0|4     |
// +--------+-----+------+

at_min (10.0) lands in bucket 1, but at_max (100.0) lands in bucket 4 — the overflow bucket — alongside the value of 150.0. If you want the max value to fall into the top bucket, set max_value slightly above your data's true maximum.

Null Handling

When the input value is null, the result is null. The bounds and bucket count must be non-null, but the value column itself propagates nulls cleanly:

val df = Seq(
  ("Alice", Some(42.0)),
  ("Bob",   None),
  ("Carol", Some(78.0)),
).toDF("name", "score")

val df2 = df
  .withColumn("bucket", expr("width_bucket(score, 0, 100, 5)"))

df2.show(false)
// +-----+-----+------+
// |name |score|bucket|
// +-----+-----+------+
// |Alice|42.0 |3     |
// |Bob  |null |null  |
// |Carol|78.0 |4     |
// +-----+-----+------+

For grouping rows into equal-count buckets (quantiles) within a window rather than equal-width ranges, see ntile. For approximate percentile-based bucketing, see percentile_approx. For hash-based bucketing where values are distributed evenly across a fixed number of buckets, see pmod.

Spark Scala Width Bucket

Price Tiers

Boundary Behavior

Null Handling

Related Functions