Spark Scala Soundex
The soundex function returns the soundex code of a string column — a four-character phonetic encoding that groups similar-sounding names together. It's useful for fuzzy matching, deduplication, and search where exact spelling varies.
def soundex(e: Column): Column
Soundex works by keeping the first letter of the string, then encoding the remaining consonants as digits according to a fixed mapping. Vowels, H, W, and Y are ignored. The result is always a letter followed by three digits (e.g., R163). Names that sound alike — like "Robert" and "Rupert" — produce the same code, making it easy to find phonetic matches across a dataset.
val df = Seq(
"Robert",
"Rupert",
"Smith",
"Smythe",
"Catherine",
"Katherine",
).toDF("name")
val df2 = df
.withColumn("soundex_code", soundex(col("name")))
df2.show(false)
// +---------+------------+
// |name |soundex_code|
// +---------+------------+
// |Robert |R163 |
// |Rupert |R163 |
// |Smith |S530 |
// |Smythe |S530 |
// |Catherine|C365 |
// |Katherine|K365 |
// +---------+------------+
Notice that "Robert" and "Rupert" share the code R163, and "Smith" and "Smythe" share S530. However, "Catherine" and "Katherine" produce different codes (C365 vs K365) because soundex preserves the first letter — a limitation to be aware of when matching names that differ only in their initial character.
Comparing Two Columns
A common use case is checking whether two name columns sound alike. Compare the soundex codes directly:
val df = Seq(
("Robert", "Rupert"),
("Smith", "Smythe"),
("Catherine", "Katherine"),
("John", "Jane"),
("Alice", "Bob"),
).toDF("name1", "name2")
val df2 = df
.withColumn("soundex1", soundex(col("name1")))
.withColumn("soundex2", soundex(col("name2")))
.withColumn("sounds_alike", soundex(col("name1")) === soundex(col("name2")))
df2.show(false)
// +---------+---------+--------+--------+------------+
// |name1 |name2 |soundex1|soundex2|sounds_alike|
// +---------+---------+--------+--------+------------+
// |Robert |Rupert |R163 |R163 |true |
// |Smith |Smythe |S530 |S530 |true |
// |Catherine|Katherine|C365 |K365 |false |
// |John |Jane |J500 |J500 |true |
// |Alice |Bob |A420 |B100 |false |
// +---------+---------+--------+--------+------------+
This pattern works well for self-joins or deduplication — group or join on the soundex code to find records that likely refer to the same person despite spelling differences.
Handling Nulls and Empty Strings
When soundex encounters a null, the result is null. An empty string returns an empty string:
val df = Seq(
"Alice",
null,
"",
"Bob",
).toDF("name")
val df2 = df
.withColumn("soundex_code", soundex(col("name")))
df2.show(false)
// +-----+------------+
// |name |soundex_code|
// +-----+------------+
// |Alice|A420 |
// |null |null |
// | | |
// |Bob |B100 |
// +-----+------------+
Related Functions
For case normalization before matching, see lower and upper. For general string comparison patterns, like and ilike offer pattern-based matching with wildcards.