Spark Scala Mask
mask replaces characters in a string with masking characters — uppercase letters become X, lowercase letters become x, and digits become n by default. It's a quick way to redact sensitive data like emails, phone numbers, and identifiers without destroying the structure of the value.
mask is a Spark SQL function. It isn't available directly in the org.apache.spark.sql.functions object, so you call it through expr():
def mask(input): Column — via expr()
def mask(input, upperChar): Column — via expr()
def mask(input, upperChar, lowerChar): Column — via expr()
def mask(input, upperChar, lowerChar, digitChar): Column — via expr()
def mask(input, upperChar, lowerChar, digitChar, otherChar): Column — via expr()
The default replacements are X for uppercase, x for lowercase, n for digits, and NULL (keep original) for all other characters. Pass NULL for any parameter to retain the original character in that category.
The mask function first appeared in version 3.4.0.
Here's a basic example masking email addresses and phone numbers:
val df = Seq(
("Alice", "alice@example.com", "555-867-5309"),
("Bob", "bob@work.org", "555-123-4567"),
("Carol", "carol@shop.net", "555-999-0000"),
("Dave", "dave@example.com", "555-222-3333"),
).toDF("name", "email", "phone")
val df2 = df
.withColumn("masked_email", expr("mask(email)"))
.withColumn("masked_phone", expr("mask(phone)"))
df2.show(false)
// +-----+-----------------+------------+-----------------+------------+
// |name |email |phone |masked_email |masked_phone|
// +-----+-----------------+------------+-----------------+------------+
// |Alice|alice@example.com|555-867-5309|xxxxx@xxxxxxx.xxx|nnn-nnn-nnnn|
// |Bob |bob@work.org |555-123-4567|xxx@xxxx.xxx |nnn-nnn-nnnn|
// |Carol|carol@shop.net |555-999-0000|xxxxx@xxxx.xxx |nnn-nnn-nnnn|
// |Dave |dave@example.com |555-222-3333|xxxx@xxxxxxx.xxx |nnn-nnn-nnnn|
// +-----+-----------------+------------+-----------------+------------+
Notice that the @, ., and - characters pass through unchanged — by default, mask only replaces letters and digits, preserving the structure of the original value.
Custom masking characters
Pass additional arguments to control what each character category is replaced with. The order is: uppercase replacement, lowercase replacement, digit replacement.
val df = Seq(
("Alice", "SSN-123-45-6789"),
("Bob", "SSN-987-65-4321"),
("Carol", "SSN-555-12-3456"),
).toDF("name", "identifier")
val df2 = df
.withColumn("masked", expr("mask(identifier, 'A', 'a', '#')"))
df2.show(false)
// +-----+---------------+---------------+
// |name |identifier |masked |
// +-----+---------------+---------------+
// |Alice|SSN-123-45-6789|AAA-###-##-####|
// |Bob |SSN-987-65-4321|AAA-###-##-####|
// |Carol|SSN-555-12-3456|AAA-###-##-####|
// +-----+---------------+---------------+
Masking only specific character types
Pass NULL for any parameter to keep the original characters in that category. This is useful when you only want to mask part of the data — for example, hiding digits in a card number while keeping the letters and punctuation intact:
val df = Seq(
("Alice", "4111-1111-1111-1111"),
("Bob", "5500-0000-0000-0004"),
("Carol", "3400-000000-00009"),
).toDF("name", "card_number")
val df2 = df
.withColumn("masked", expr("mask(card_number, NULL, NULL, '#')"))
df2.show(false)
// +-----+-------------------+-------------------+
// |name |card_number |masked |
// +-----+-------------------+-------------------+
// |Alice|4111-1111-1111-1111|####-####-####-####|
// |Bob |5500-0000-0000-0004|####-####-####-####|
// |Carol|3400-000000-00009 |####-######-##### |
// +-----+-------------------+-------------------+
Since the card numbers are all digits and dashes, passing NULL for the uppercase and lowercase parameters keeps those categories untouched (there are none here), while # replaces every digit.
Null and empty string handling
When the input is null, mask returns null. An empty string passes through unchanged.
val df = Seq(
("Alice", "alice@example.com"),
("Bob", null),
("Carol", "carol@shop.net"),
("Dave", ""),
).toDF("name", "email")
val df2 = df
.withColumn("masked", expr("mask(email)"))
df2.show(false)
// +-----+-----------------+-----------------+
// |name |email |masked |
// +-----+-----------------+-----------------+
// |Alice|alice@example.com|xxxxx@xxxxxxx.xxx|
// |Bob |null |null |
// |Carol|carol@shop.net |xxxxx@xxxx.xxx |
// |Dave | | |
// +-----+-----------------+-----------------+
Related functions
For replacing specific substrings with other values, see replace. For pattern-based substitution using regular expressions, see regexp_replace. For character-by-character substitution where you map individual characters to replacements, see translate.