Spark Scala Replace
replace substitutes all occurrences of a substring within a string column. It's the straightforward choice when you need a literal find-and-replace without regular expressions.
replace is a Spark SQL function. It isn't available directly in the org.apache.spark.sql.functions object, so you call it through expr():
def replace(str, search, replacement): Column — via expr()
str is the source string column, search is the literal substring to find, and replacement is what to put in its place. Every occurrence of search is replaced — not just the first one. The match is case-sensitive.
Here's a basic example replacing a domain name in email addresses:
val df = Seq(
("Alice", "alice@oldcompany.com"),
("Bob", "bob@oldcompany.com"),
("Carol", "carol@oldcompany.com"),
("Dave", "dave@oldcompany.com"),
).toDF("name", "email")
val df2 = df
.withColumn("new_email", expr("replace(email, 'oldcompany.com', 'newcompany.com')"))
df2.show(false)
// +-----+--------------------+--------------------+
// |name |email |new_email |
// +-----+--------------------+--------------------+
// |Alice|alice@oldcompany.com|alice@newcompany.com|
// |Bob |bob@oldcompany.com |bob@newcompany.com |
// |Carol|carol@oldcompany.com|carol@newcompany.com|
// |Dave |dave@oldcompany.com |dave@newcompany.com |
// +-----+--------------------+--------------------+
Inside the expr() string, column names are unquoted and literal strings use single quotes.
Removing characters with replace
Set the replacement to an empty string to delete all occurrences of the search substring. This is useful for stripping formatting characters like dashes, parentheses, or spaces:
val df = Seq(
"2024-01-15",
"2024-06-30",
"2024-12-25",
).toDF("date_str")
val df2 = df
.withColumn("cleaned", expr("replace(date_str, '-', '')"))
df2.show(false)
// +----------+--------+
// |date_str |cleaned |
// +----------+--------+
// |2024-01-15|20240115|
// |2024-06-30|20240630|
// |2024-12-25|20241225|
// +----------+--------+
Handling nulls and empty strings
When the source column is null, replace returns null. An empty string passes through unchanged — no error is thrown:
val df = Seq(
("Alice", null),
("Bob", "bob@example.com"),
("Carol", ""),
("Dave", "dave@example.com"),
).toDF("name", "email")
val df2 = df
.withColumn("replaced", expr("replace(email, 'example.com', 'work.com')"))
df2.show(false)
// +-----+----------------+-------------+
// |name |email |replaced |
// +-----+----------------+-------------+
// |Alice|null |null |
// |Bob |bob@example.com |bob@work.com |
// |Carol| | |
// |Dave |dave@example.com|dave@work.com|
// +-----+----------------+-------------+
replace vs regexp_replace
replace does a literal substring match. regexp_replace takes a Java regular expression, so it can match patterns like \d{4} or [A-Z]+. If your search string is a fixed literal, replace is simpler and avoids the need to escape regex metacharacters.
Related functions
For pattern-based substitution using regular expressions, see regexp_replace. For character-by-character substitution (mapping individual characters to replacements), see translate. For replacing characters at a specific position in a string, see overlay.