Job Board
Consulting

Spark Scala to_date and to_timestamp

to_date and to_timestamp parse string columns into proper date and timestamp types. They're the functions you reach for whenever raw data lands as text — CSV files, JSON payloads, or upstream queries — and you need to do anything date-related with it. Spark 3.4 also adds to_timestamp_ntz and to_timestamp_ltz for explicit time-zone semantics.

Parsing dates with to_date

The to_date function converts a string column to a date column. It has two signatures: one that assumes the standard ISO format (yyyy-MM-dd), and one that accepts a custom format pattern.

def to_date(e: Column): Column

def to_date(e: Column, fmt: String): Column

When the input is already in yyyy-MM-dd format, you don't need to pass a format string:

val df = Seq(
  "2026-01-15",
  "2025-12-20",
  "2025-07-04",
  "2024-02-29",
).toDF("event_date_str")

val df2 = df
  .withColumn("event_date", to_date(col("event_date_str")))

df2.show(false)
// +--------------+----------+
// |event_date_str|event_date|
// +--------------+----------+
// |2026-01-15    |2026-01-15|
// |2025-12-20    |2025-12-20|
// |2025-07-04    |2025-07-04|
// |2024-02-29    |2024-02-29|
// +--------------+----------+

df2.printSchema()
// root
//  |-- event_date_str: string (nullable = true)
//  |-- event_date: date (nullable = true)

The output looks identical to the input string, but the schema confirms the column is now a date rather than a string. That distinction matters — once the column is a real date, it works with date arithmetic, datediff, date_format, and every other date/time function.

Parsing non-ISO date formats

Real-world data rarely arrives in ISO format. Pass a Java date pattern as the second argument to handle other formats:

val df = Seq(
  "01/15/2026",
  "12/20/2025",
  "07/04/2025",
  "02/29/2024",
).toDF("event_date_str")

val df2 = df
  .withColumn("event_date", to_date(col("event_date_str"), "MM/dd/yyyy"))

df2.show(false)
// +--------------+----------+
// |event_date_str|event_date|
// +--------------+----------+
// |01/15/2026    |2026-01-15|
// |12/20/2025    |2025-12-20|
// |07/04/2025    |2025-07-04|
// |02/29/2024    |2024-02-29|
// +--------------+----------+

The format pattern uses the same tokens as date_formatyyyy for the four-digit year, MM for the zero-padded month, dd for the zero-padded day. See the date_format examples for a full tour of the pattern syntax.

Parsing timestamps with to_timestamp

to_timestamp works the same way as to_date, but produces a timestamp column that includes time-of-day information.

def to_timestamp(s: Column): Column

def to_timestamp(s: Column, fmt: String): Column

The default format is yyyy-MM-dd HH:mm:ss:

val df = Seq(
  "2026-01-15 09:30:45",
  "2025-12-20 23:59:59",
  "2025-07-04 00:00:01",
  "2024-02-29 12:00:00",
).toDF("event_ts_str")

val df2 = df
  .withColumn("event_ts", to_timestamp(col("event_ts_str")))

df2.show(false)
// +-------------------+-------------------+
// |event_ts_str       |event_ts           |
// +-------------------+-------------------+
// |2026-01-15 09:30:45|2026-01-15 09:30:45|
// |2025-12-20 23:59:59|2025-12-20 23:59:59|
// |2025-07-04 00:00:01|2025-07-04 00:00:01|
// |2024-02-29 12:00:00|2024-02-29 12:00:00|
// +-------------------+-------------------+

df2.printSchema()
// root
//  |-- event_ts_str: string (nullable = true)
//  |-- event_ts: timestamp (nullable = true)

For timestamps in any other shape, supply a format pattern:

val df = Seq(
  "Jan 15, 2026 9:30 AM",
  "Dec 20, 2025 11:59 PM",
  "Jul 4, 2025 12:01 AM",
  "Feb 29, 2024 12:00 PM",
).toDF("event_ts_str")

val df2 = df
  .withColumn("event_ts", to_timestamp(col("event_ts_str"), "MMM d, yyyy h:mm a"))

df2.show(false)
// +---------------------+-------------------+
// |event_ts_str         |event_ts           |
// +---------------------+-------------------+
// |Jan 15, 2026 9:30 AM |2026-01-15 09:30:00|
// |Dec 20, 2025 11:59 PM|2025-12-20 23:59:00|
// |Jul 4, 2025 12:01 AM |2025-07-04 00:01:00|
// |Feb 29, 2024 12:00 PM|2024-02-29 12:00:00|
// +---------------------+-------------------+

MMM matches the abbreviated month name, h is the 12-hour clock, and a matches AM/PM.

to_timestamp_ntz and to_timestamp_ltz

Spark 3.4 adds two SQL-only variants that parse into specific timestamp types:

  • to_timestamp_ntz parses into TIMESTAMP_NTZ — a timestamp without a time zone, treated as a wall-clock value.
  • to_timestamp_ltz parses into TIMESTAMP_LTZ — a timestamp with the session local time zone, the same type returned by to_timestamp.

Neither has a Scala API entry in org.apache.spark.sql.functions, so you call them through expr():

to_timestamp_ntz(timestamp_str[, fmt]) — via expr()

to_timestamp_ltz(timestamp_str[, fmt]) — via expr()

The to_timestamp_ntz and to_timestamp_ltz functions first appeared in version 3.4.0.

val df = Seq(
  "2026-01-15 09:30:45",
  "2025-12-20 23:59:59",
  "2025-07-04 00:00:01",
).toDF("event_ts_str")

val df2 = df
  .withColumn("ts_ntz", expr("to_timestamp_ntz(event_ts_str)"))
  .withColumn("ts_ltz", expr("to_timestamp_ltz(event_ts_str)"))

df2.show(false)
// +-------------------+-------------------+-------------------+
// |event_ts_str       |ts_ntz             |ts_ltz             |
// +-------------------+-------------------+-------------------+
// |2026-01-15 09:30:45|2026-01-15 09:30:45|2026-01-15 09:30:45|
// |2025-12-20 23:59:59|2025-12-20 23:59:59|2025-12-20 23:59:59|
// |2025-07-04 00:00:01|2025-07-04 00:00:01|2025-07-04 00:00:01|
// +-------------------+-------------------+-------------------+

df2.printSchema()
// root
//  |-- event_ts_str: string (nullable = true)
//  |-- ts_ntz: timestamp_ntz (nullable = true)
//  |-- ts_ltz: timestamp (nullable = true)

The values are identical for these inputs because no time-zone offset is supplied — the difference is in the schema. ts_ntz is timestamp_ntz and is never adjusted for time zones. ts_ltz is the regular timestamp type, which adjusts to the session time zone when the input includes an offset (e.g. 2026-01-15 09:30:45+05:00).

Reach for to_timestamp_ntz when you want to record a wall-clock time that means the same thing everywhere — appointment times, scheduled events, log timestamps already normalized to UTC. Reach for to_timestamp_ltz (or just to_timestamp) when the timestamp represents an absolute instant that should shift with the viewer's time zone.

Handling invalid input

Strings that don't match the expected format produce null rather than throwing an exception. The same goes for null inputs:

val df = Seq(
  "2026-01-15",
  "not-a-date",
  null,
  "2026-13-99",
).toDF("event_date_str")

val df2 = df
  .withColumn("event_date", to_date(col("event_date_str")))
  .withColumn("event_ts",   to_timestamp(col("event_date_str")))

df2.show(false)
// +--------------+----------+-------------------+
// |event_date_str|event_date|event_ts           |
// +--------------+----------+-------------------+
// |2026-01-15    |2026-01-15|2026-01-15 00:00:00|
// |not-a-date    |null      |null               |
// |null          |null      |null               |
// |2026-13-99    |null      |null               |
// +--------------+----------+-------------------+

A few things to note: to_timestamp happily accepts a date-only string and zero-pads the time. Garbage strings and impossible dates (month 13, day 99) both yield null instead of an error, so you'll want to filter those out (or count them) downstream rather than trusting the result blindly.

To go the other direction — turning a date or timestamp back into a formatted string — use date_format. To extract individual parts (year, month, day, hour, etc.) from the parsed value, see year, month, and day and hour, minute, and second.

Example Details

Created: 2026-05-09 10:18:24 PM

Last Updated: 2026-05-09 10:18:24 PM