Spark Scala Regexp Replace Examples

The regexp_replace function is one of those most powerful string manipulation functions. Let's look at a simple example:

val df = Seq(
  "Hello World!"
).toDF("example")

val df2 = df
  .withColumn("regexp_replace", regexp_replace(col("example"), "World", "Galaxy"))

df2.show()

// +------------+--------------+
// |     example|regexp_replace|
// +------------+--------------+
// |Hello World!| Hello Galaxy!|
// +------------+--------------+

Regular expression matching and replace are a comonly used tool within data etl pipelines to transform, clean your string data and extract more structured information from it.

Often called regex or regexp, regular expressions, can sometimes get confusing!

Let's examine a more complex example:

val df = Seq(
  ("john@example.com"),
  ("jane.doe@example.org"),
  ("mike.smith@example.net")
).toDF("email")

val df2 = df
  .withColumn("username", regexp_replace(col("email"), "^(.+)@.+$", "$1"))
  .withColumn("domain", regexp_replace(col("email"), "^.+@(.+)$", "$1"))

df2.show()
// +--------------------+----------+-----------+
// |               email|  username|     domain|
// +--------------------+----------+-----------+
// |    john@example.com|      john|example.com|
// |jane.doe@example.org|  jane.doe|example.org|
// |mike.smith@exampl...|mike.smith|example.net|
// +--------------------+----------+-----------+

Here in this example you can start to get a sense of the power the regex pattern matching can provide. Here we are performing a patern match with a capture – the portion of the string within the parenthesis. Then we use the captured value $1 – the first captured value – as the replacement value. In this case we are using this technique to extract data from within an existing string field.

If you find yourself in a situation where you are trying to extract data from a group, you can also use the regexp_extract function. Here is the above example re-written using it instead:

val df = Seq(
  ("john@example.com"),
  ("jane.doe@example.org"),
  ("mike.smith@example.net")
).toDF("email")

val df2 = df
  .withColumn("username", regexp_extract(col("email"), "^(.+)@(.+)$", 1))
  .withColumn("domain", regexp_extract(col("email"), "^(.+)@(.+)$", 2))

df2.show()
// +--------------------+----------+-----------+
// |               email|  username|     domain|
// +--------------------+----------+-----------+
// |    john@example.com|      john|example.com|
// |jane.doe@example.org|  jane.doe|example.org|
// |mike.smith@exampl...|mike.smith|example.net|
// +--------------------+----------+-----------+

When working with sensitive data you will encounter times when you need to redact it. The regexp_replace function to the rescue! Let's take a look at a situation where we have sensitive credit card information we want to redact:

val df = Seq(
  ("John Doe", "1234-5678-9012-3456"),
  ("Jane Smith", "6011-1234-5678-9012"),
  ("Michael Johnson", "5424-1234-5678-9012"),
  ("Emily Williams", "4111-1234-5678-9012")
).toDF("name", "credit_card")

val df2 = df
  .withColumn("redacted_card", regexp_replace(col("credit_card"), "\d{4}-\d{4}-\d{4}-", "****-****-****-"))

df2.show()

// +---------------+-------------------+-------------------+
// |           name|        credit_card|      redacted_card|
// +---------------+-------------------+-------------------+
// |       John Doe|1234-5678-9012-3456|****-****-****-3456|
// |     Jane Smith|6011-1234-5678-9012|****-****-****-9012|
// |Michael Johnson|5424-1234-5678-9012|****-****-****-9012|
// | Emily Williams|4111-1234-5678-9012|****-****-****-9012|
// +---------------+-------------------+-------------------+

If you'd like to see some more regex replace examples drop us a line and let us know.