Spark Scala Advanced Regex Functions
Beyond the familiar regexp_replace and regexp_extract, Spark 3.4 ships a set of regex helpers — regexp_count, regexp_extract_all, regexp_instr, regexp_like, and regexp_substr — that cover the common "count / extract-all / locate / test / grab-first" jobs that otherwise require chaining helpers or writing UDFs. They're SQL-only, so you reach them through expr() in Spark Scala.
All five were introduced in Spark 3.4, so an upgrade to at least 3.4.0 is required.
regexp_count
regexp_count(str, regexp) — Returns the number of non-overlapping matches of regexp in str
regexp_count returns an integer count of how many times a pattern matches within a string. Handy when you want to flag rows with more than N matches, or just know how dense your matches are.
val df = Seq(
"Login from 10.0.0.1 and 10.0.0.2",
"Login from 192.168.1.5",
"No addresses here",
"Chain: 10.1.1.1 -> 10.1.1.2 -> 10.1.1.3",
).toDF("log_line")
val df2 = df
.withColumn("ip_count", expr("regexp_count(log_line, '\\d+\\.\\d+\\.\\d+\\.\\d+')"))
df2.show(false)
// +---------------------------------------+--------+
// |log_line |ip_count|
// +---------------------------------------+--------+
// |Login from 10.0.0.1 and 10.0.0.2 |2 |
// |Login from 192.168.1.5 |1 |
// |No addresses here |0 |
// |Chain: 10.1.1.1 -> 10.1.1.2 -> 10.1.1.3|3 |
// +---------------------------------------+--------+
A string with no matches returns 0, not null. That makes it safe to use directly in boolean filters (> 0) without worrying about null-safe comparisons.
regexp_extract_all
regexp_extract_all(str, regexp[, idx]) — Returns an array of all matches of regexp in str
regexp_extract only gives you one match per row. regexp_extract_all gives you every match as an array — which is usually what you want when scraping structured tokens out of free-form text. The optional idx argument selects which capture group to return (defaults to 1); pass 0 to return the whole match.
val df = Seq(
"Contact alice@example.com or bob@example.org for details.",
"Email: carol@example.net",
"No email in this message.",
).toDF("message")
val df2 = df
.withColumn("emails", expr("regexp_extract_all(message, '[A-Za-z0-9._]+@[A-Za-z0-9.]+', 0)"))
df2.show(false)
// +---------------------------------------------------------+------------------------------------+
// |message |emails |
// +---------------------------------------------------------+------------------------------------+
// |Contact alice@example.com or bob@example.org for details.|[alice@example.com, bob@example.org]|
// |Email: carol@example.net |[carol@example.net] |
// |No email in this message. |[] |
// +---------------------------------------------------------+------------------------------------+
Rows with no matches return an empty array rather than null. Combine with explode to turn each match into its own row, or with size to get a match count (similar to regexp_count but with the matched values preserved).
regexp_instr
regexp_instr(str, regexp[, idx]) — Returns the 1-based position of the first match of regexp in str, or 0 if no match
regexp_instr is the regex cousin of instr — it returns the 1-based position of the first character of the match. If there is no match it returns 0, not null.
val df = Seq(
"Order ABC-12345 shipped",
"Ref: 987 pending",
"No digits in this string",
).toDF("note")
val df2 = df
.withColumn("first_digit_pos", expr("regexp_instr(note, '\\d+')"))
df2.show(false)
// +------------------------+---------------+
// |note |first_digit_pos|
// +------------------------+---------------+
// |Order ABC-12345 shipped |11 |
// |Ref: 987 pending |6 |
// |No digits in this string|0 |
// +------------------------+---------------+
For locating a plain substring — no regex needed — reach for instr or locate instead; they're simpler and don't require escaping regex metacharacters.
regexp_like
regexp_like(str, regexp) — Returns true if str matches regexp, false otherwise
regexp_like is the boolean test — does this string match the pattern? It's the SQL-standard equivalent of Spark's rlike Column method, and either works. Use it when you want a boolean column (rather than a filter) to carry the match result forward.
val df = Seq(
("Alice", "alice@example.com"),
("Bob", "not-an-email"),
("Carol", "carol@example.org"),
("Dave", "dave@@example.com"),
).toDF("name", "contact")
val df2 = df
.withColumn("is_email", expr("regexp_like(contact, '^[A-Za-z0-9._]+@[A-Za-z0-9.]+$')"))
df2.show(false)
// +-----+-----------------+--------+
// |name |contact |is_email|
// +-----+-----------------+--------+
// |Alice|alice@example.com|true |
// |Bob |not-an-email |false |
// |Carol|carol@example.org|true |
// |Dave |dave@@example.com|false |
// +-----+-----------------+--------+
If you only need to filter rows (not carry a boolean column through), the rlike method on Column is usually more idiomatic: df.filter(col("contact").rlike("^[A-Za-z0-9._]+@[A-Za-z0-9.]+$")). See rlike in the like / ilike article for more.
regexp_substr
regexp_substr(str, regexp) — Returns the first substring of str that matches regexp, or null if no match
regexp_substr returns the first matched substring — conceptually the same as regexp_extract_all(..., 0)[0], but without the array wrapping.
val df = Seq(
"Call me at 555-123-4567 tomorrow",
"Old number: 555-000-9999, new: 555-111-2222",
"No phone in this line",
).toDF("message")
val df2 = df
.withColumn("phone", expr("regexp_substr(message, '\\d{3}-\\d{3}-\\d{4}')"))
df2.show(false)
// +-------------------------------------------+------------+
// |message |phone |
// +-------------------------------------------+------------+
// |Call me at 555-123-4567 tomorrow |555-123-4567|
// |Old number: 555-000-9999, new: 555-111-2222|555-000-9999|
// |No phone in this line |null |
// +-------------------------------------------+------------+
Unlike regexp_count, regexp_instr, and regexp_extract_all, a no-match result is null here rather than a zero or empty array — treat it accordingly in downstream logic.
Related functions
For replacing matches, see regexp_replace. For extracting a single capture group, see regexp_extract in the same article. For simpler substring searches without regex, see instr and locate. For SQL-style pattern matching with LIKE wildcards, see like, ilike, and rlike.