Spark Scala Pmod

The pmod function returns the positive remainder of dividing one column by another. Unlike the standard % operator, which mirrors the sign of the dividend, pmod keeps the result non-negative whenever the divisor is positive. That makes it the right tool for hash bucketing and any cyclic indexing where a negative remainder would point you at the wrong bucket.

def pmod(dividend: Column, divisor: Column): Column

The function returns the same numeric type as its inputs. Compare it side by side with the regular % operator to see where the two diverge:

val df = Seq(
  (10, 3),
  (-10, 3),
  (10, -3),
  (-10, -3),
  (7, 7),
  (0, 5),
).toDF("dividend", "divisor")

val df2 = df
  .withColumn("mod", col("dividend") % col("divisor"))
  .withColumn("pmod", pmod(col("dividend"), col("divisor")))

df2.show(false)
// +--------+-------+---+----+
// |dividend|divisor|mod|pmod|
// +--------+-------+---+----+
// |10      |3      |1  |1   |
// |-10     |3      |-1 |2   |
// |10      |-3     |1  |1   |
// |-10     |-3     |-1 |-1  |
// |7       |7      |0  |0   |
// |0       |5      |0  |0   |
// +--------+-------+---+----+

The interesting row is -10 mod 3. The % operator returns -1 because Spark (like Java) takes the sign of the dividend. pmod instead returns 2, which is what you'd get by computing ((-10 mod 3) + 3) mod 3. Note that "positive" only holds when the divisor is positive — pmod(-10, -3) is still -1. In practice you'll almost always pass a positive divisor, so the name is accurate to the use case.

Hash Bucketing

The most common reason to reach for pmod is bucketing records by a hash. Spark's hash function can return negative integers, and value % numBuckets on a negative input produces a negative bucket — an off-by-one bug waiting to happen. pmod fixes this:

val df = Seq(
  ("order-1001", "alice"),
  ("order-1002", "bob"),
  ("order-1003", "carol"),
  ("order-1004", "dave"),
  ("order-1005", "eve"),
  ("order-1006", "frank"),
).toDF("order_id", "customer")

val df2 = df
  .withColumn("bucket", pmod(hash(col("order_id")), lit(4)))

df2.show(false)
// +----------+--------+------+
// |order_id  |customer|bucket|
// +----------+--------+------+
// |order-1001|alice   |0     |
// |order-1002|bob     |2     |
// |order-1003|carol   |3     |
// |order-1004|dave    |3     |
// |order-1005|eve     |2     |
// |order-1006|frank   |2     |
// +----------+--------+------+

Every bucket value falls in the [0, 4) range, so it's safe to use directly as an array index, a partition key, or a routing target. Replace lit(4) with whatever bucket count you need.

Division by Zero and Nulls

pmod returns null when the divisor is zero rather than throwing an error, which matches Spark's default behavior for the % operator. Nulls on either input also propagate as null:

val df = Seq(
  (5, 3),
  (10, 0),
  (-7, 4),
).toDF("dividend", "divisor")

val df2 = df
  .withColumn("pmod", pmod(col("dividend"), col("divisor")))

df2.show(false)
// +--------+-------+----+
// |dividend|divisor|pmod|
// +--------+-------+----+
// |5       |3      |2   |
// |10      |0      |null|
// |-7      |4      |1   |
// +--------+-------+----+

The pmod(-7, 4) = 1 row is worth noting. Standard % would give -3; pmod adds the divisor back to land in [0, 4).