Spark Scala Advanced Regex Functions

Beyond the familiar regexp_replace and regexp_extract, Spark 3.4 ships a set of regex helpers — regexp_count, regexp_extract_all, regexp_instr, regexp_like, and regexp_substr — that cover the common "count / extract-all / locate / test / grab-first" jobs that otherwise require chaining helpers or writing UDFs. They're SQL-only, so you reach them through expr() in Spark Scala.

All five were introduced in Spark 3.4, so an upgrade to at least 3.4.0 is required.

regexp_count

regexp_count(str, regexp) — Returns the number of non-overlapping matches of regexp in str

regexp_count returns an integer count of how many times a pattern matches within a string. Handy when you want to flag rows with more than N matches, or just know how dense your matches are.

val df = Seq(
  "Login from 10.0.0.1 and 10.0.0.2",
  "Login from 192.168.1.5",
  "No addresses here",
  "Chain: 10.1.1.1 -> 10.1.1.2 -> 10.1.1.3",
).toDF("log_line")

val df2 = df
  .withColumn("ip_count", expr("regexp_count(log_line, '\\d+\\.\\d+\\.\\d+\\.\\d+')"))

df2.show(false)
// +---------------------------------------+--------+
// |log_line                               |ip_count|
// +---------------------------------------+--------+
// |Login from 10.0.0.1 and 10.0.0.2       |2       |
// |Login from 192.168.1.5                 |1       |
// |No addresses here                      |0       |
// |Chain: 10.1.1.1 -> 10.1.1.2 -> 10.1.1.3|3       |
// +---------------------------------------+--------+

A string with no matches returns 0, not null. That makes it safe to use directly in boolean filters (> 0) without worrying about null-safe comparisons.

regexp_extract_all

regexp_extract_all(str, regexp[, idx]) — Returns an array of all matches of regexp in str

regexp_extract only gives you one match per row. regexp_extract_all gives you every match as an array — which is usually what you want when scraping structured tokens out of free-form text. The optional idx argument selects which capture group to return (defaults to 1); pass 0 to return the whole match.

val df = Seq(
  "Contact alice@example.com or bob@example.org for details.",
  "Email: carol@example.net",
  "No email in this message.",
).toDF("message")

val df2 = df
  .withColumn("emails", expr("regexp_extract_all(message, '[A-Za-z0-9._]+@[A-Za-z0-9.]+', 0)"))

df2.show(false)
// +---------------------------------------------------------+------------------------------------+
// |message                                                  |emails                              |
// +---------------------------------------------------------+------------------------------------+
// |Contact alice@example.com or bob@example.org for details.|[alice@example.com, bob@example.org]|
// |Email: carol@example.net                                 |[carol@example.net]                 |
// |No email in this message.                                |[]                                  |
// +---------------------------------------------------------+------------------------------------+

Rows with no matches return an empty array rather than null. Combine with explode to turn each match into its own row, or with size to get a match count (similar to regexp_count but with the matched values preserved).

regexp_instr

regexp_instr(str, regexp[, idx]) — Returns the 1-based position of the first match of regexp in str, or 0 if no match

regexp_instr is the regex cousin of instr — it returns the 1-based position of the first character of the match. If there is no match it returns 0, not null.

val df = Seq(
  "Order ABC-12345 shipped",
  "Ref: 987 pending",
  "No digits in this string",
).toDF("note")

val df2 = df
  .withColumn("first_digit_pos", expr("regexp_instr(note, '\\d+')"))

df2.show(false)
// +------------------------+---------------+
// |note                    |first_digit_pos|
// +------------------------+---------------+
// |Order ABC-12345 shipped |11             |
// |Ref: 987 pending        |6              |
// |No digits in this string|0              |
// +------------------------+---------------+

For locating a plain substring — no regex needed — reach for instr or locate instead; they're simpler and don't require escaping regex metacharacters.

regexp_like

regexp_like(str, regexp) — Returns true if str matches regexp, false otherwise

regexp_like is the boolean test — does this string match the pattern? It's the SQL-standard equivalent of Spark's rlike Column method, and either works. Use it when you want a boolean column (rather than a filter) to carry the match result forward.

val df = Seq(
  ("Alice", "alice@example.com"),
  ("Bob",   "not-an-email"),
  ("Carol", "carol@example.org"),
  ("Dave",  "dave@@example.com"),
).toDF("name", "contact")

val df2 = df
  .withColumn("is_email", expr("regexp_like(contact, '^[A-Za-z0-9._]+@[A-Za-z0-9.]+$')"))

df2.show(false)
// +-----+-----------------+--------+
// |name |contact          |is_email|
// +-----+-----------------+--------+
// |Alice|alice@example.com|true    |
// |Bob  |not-an-email     |false   |
// |Carol|carol@example.org|true    |
// |Dave |dave@@example.com|false   |
// +-----+-----------------+--------+

If you only need to filter rows (not carry a boolean column through), the rlike method on Column is usually more idiomatic: df.filter(col("contact").rlike("^[A-Za-z0-9._]+@[A-Za-z0-9.]+$")). See rlike in the like / ilike article for more.

regexp_substr

regexp_substr(str, regexp) — Returns the first substring of str that matches regexp, or null if no match

regexp_substr returns the first matched substring — conceptually the same as regexp_extract_all(..., 0)[0], but without the array wrapping.

val df = Seq(
  "Call me at 555-123-4567 tomorrow",
  "Old number: 555-000-9999, new: 555-111-2222",
  "No phone in this line",
).toDF("message")

val df2 = df
  .withColumn("phone", expr("regexp_substr(message, '\\d{3}-\\d{3}-\\d{4}')"))

df2.show(false)
// +-------------------------------------------+------------+
// |message                                    |phone       |
// +-------------------------------------------+------------+
// |Call me at 555-123-4567 tomorrow           |555-123-4567|
// |Old number: 555-000-9999, new: 555-111-2222|555-000-9999|
// |No phone in this line                      |null        |
// +-------------------------------------------+------------+

Unlike regexp_count, regexp_instr, and regexp_extract_all, a no-match result is null here rather than a zero or empty array — treat it accordingly in downstream logic.