Job Board
Consulting

Spark Scala datediff and date_diff

datediff returns the number of days between two date columns. It's the go-to function for computing things like "days to ship", "days since signup", or "age in days". Spark 3.4 added a SQL-only alias date_diff that does the same thing — useful when you're writing a SQL expression but otherwise interchangeable.

datediff

def datediff(end: Column, start: Column): Column

datediff returns end - start in whole days. The order matters: the end date comes first, the start date comes second. Get them backwards and you'll see negative numbers everywhere.

The most common shape is two date columns from the same row — for example, an ordered_on and a shipped_on column that you want to turn into a days_to_ship metric:

val df = Seq(
  ("order_a", "2026-01-15", "2026-01-22"),
  ("order_b", "2026-02-01", "2026-02-04"),
  ("order_c", "2025-12-20", "2026-01-05"),
  ("order_d", "2025-07-04", "2025-08-15"),
).toDF("order_id", "ordered_on", "shipped_on")

val df2 = df
  .withColumn("days_to_ship", datediff(col("shipped_on"), col("ordered_on")))

df2.show(false)
// +--------+----------+----------+------------+
// |order_id|ordered_on|shipped_on|days_to_ship|
// +--------+----------+----------+------------+
// |order_a |2026-01-15|2026-01-22|7           |
// |order_b |2026-02-01|2026-02-04|3           |
// |order_c |2025-12-20|2026-01-05|16          |
// |order_d |2025-07-04|2025-08-15|42          |
// +--------+----------+----------+------------+

The input columns can be strings in yyyy-MM-dd format (Spark casts them implicitly), DateType, or TimestampType. With timestamps, only the date portion is used — the time-of-day is ignored, so 2026-01-15 23:59:59 and 2026-01-15 00:00:01 are treated as the same day. The return type is always an integer.

Notice that datediff correctly handles month and year boundaries — order_c crosses from December into January and the count is 16 days, not some other off-by-one number.

Days from a fixed reference date

When you want to compute days against a single anchor — like "today" or a deadline — wrap the literal date in lit() and pass it as the end argument:

val df = Seq(
  ("alice", "2025-12-15"),
  ("bob",   "2026-01-01"),
  ("carol", "2026-02-20"),
  ("dave",  "2026-04-30"),
).toDF("user_id", "signed_up_on")

val df2 = df
  .withColumn(
    "days_since_signup",
    datediff(lit("2026-05-04"), col("signed_up_on")),
  )

df2.show(false)
// +-------+------------+-----------------+
// |user_id|signed_up_on|days_since_signup|
// +-------+------------+-----------------+
// |alice  |2025-12-15  |140              |
// |bob    |2026-01-01  |123              |
// |carol  |2026-02-20  |73               |
// |dave   |2026-04-30  |4                |
// +-------+------------+-----------------+

For "days since signup as of right now", swap lit("2026-05-04") for current_date() and Spark will use the current date at query time.

Argument order matters

If you flip the arguments, you get the same magnitude with the opposite sign. There's no "absolute value" version built in — pass the dates in the order that produces the sign you want, or wrap the result in abs() if you don't care about direction:

val df = Seq(
  ("2026-01-15", "2026-01-22"),
  ("2026-01-22", "2026-01-15"),
).toDF("a", "b")

val df2 = df
  .withColumn("b_minus_a", datediff(col("b"), col("a")))
  .withColumn("a_minus_b", datediff(col("a"), col("b")))

df2.show(false)
// +----------+----------+---------+---------+
// |a         |b         |b_minus_a|a_minus_b|
// +----------+----------+---------+---------+
// |2026-01-15|2026-01-22|7        |-7       |
// |2026-01-22|2026-01-15|-7       |7        |
// +----------+----------+---------+---------+

A useful mnemonic: read datediff(end, start) as "end minus start". The function name puts diff second, but the arguments are in the same order you'd write the subtraction.

date_diff: the SQL-only Spark 3.4 alias

date_diff was added in Spark 3.4.0 as an alias for datediff. It isn't exposed in the Scala functions API, so you call it through expr():

def date_diff(endDate, startDate): Column — via expr()

The date_diff function first appeared in version 3.4.0 and is defined as:

Returns the number of days from startDate to endDate.

It returns the same result as datediff for the same inputs:

val df = Seq(
  ("2026-01-15", "2026-01-22"),
  ("2025-12-20", "2026-01-05"),
  ("2025-07-04", "2025-08-15"),
).toDF("ordered_on", "shipped_on")

val df2 = df
  .withColumn("datediff",  datediff(col("shipped_on"), col("ordered_on")))
  .withColumn("date_diff", expr("date_diff(shipped_on, ordered_on)"))

df2.show(false)
// +----------+----------+--------+---------+
// |ordered_on|shipped_on|datediff|date_diff|
// +----------+----------+--------+---------+
// |2026-01-15|2026-01-22|7       |7        |
// |2025-12-20|2026-01-05|16      |16       |
// |2025-07-04|2025-08-15|42      |42       |
// +----------+----------+--------+---------+

For Scala code, prefer datediff — it's the native Column function and doesn't require expr(). date_diff is mainly useful when you're writing the expression as a SQL string (e.g., inside selectExpr or a SQL view definition) and want the snake_case spelling.

If you're targeting Spark 3.0 through 3.3, date_diff doesn't exist yet — use datediff.

Null handling

When either input is null, the result is null. There's no special behavior here — null in, null out — but it's worth seeing on real data so you know what to expect when your input has gaps:

val df = Seq(
  (Some("2026-01-15"), Some("2026-01-22")),
  (None,               Some("2026-01-22")),
  (Some("2026-01-15"), None),
  (None,               None),
).toDF("ordered_on", "shipped_on")

val df2 = df
  .withColumn("days_to_ship", datediff(col("shipped_on"), col("ordered_on")))

df2.show(false)
// +----------+----------+------------+
// |ordered_on|shipped_on|days_to_ship|
// +----------+----------+------------+
// |2026-01-15|2026-01-22|7           |
// |null      |2026-01-22|null        |
// |2026-01-15|null      |null        |
// |null      |null      |null        |
// +----------+----------+------------+

If you'd rather treat nulls as a fallback date (say, "still in transit means today"), use coalesce on the input column before passing it to datediff.

To shift a date forward or backward by a fixed number of days or months instead of measuring the gap, see date_add, date_sub, and add_months. For the gap measured in months instead of days, see months_between. To get the current date for "as of today" comparisons, see current_date and current_timestamp.

Example Details

Created: 2026-05-04 10:22:56 PM

Last Updated: 2026-05-04 10:22:56 PM