Spark Scala Levenshtein Distance

The levenshtein function computes the Levenshtein distance between two string columns — the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into the other. It's useful for fuzzy matching, typo detection, and deduplication.

def levenshtein(l: Column, r: Column): Column

The function returns an integer representing the edit distance. Identical strings return 0. The larger the number, the more different the strings are.

val df = Seq(
  ("kitten", "sitting"),
  ("spark", "park"),
  ("scala", "scala"),
  ("hello", "hallo"),
  ("data", "date"),
).toDF("string1", "string2")

val df2 = df
  .withColumn("distance", levenshtein(col("string1"), col("string2")))

df2.show(false)
// +-------+-------+--------+
// |string1|string2|distance|
// +-------+-------+--------+
// |kitten |sitting|3       |
// |spark  |park   |1       |
// |scala  |scala  |0       |
// |hello  |hallo  |1       |
// |data   |date   |1       |
// +-------+-------+--------+

"kitten" → "sitting" requires three edits: substitute k → s, substitute e → i, and insert g. "spark" → "park" needs just one deletion.

Fuzzy Name Matching

A common use case is flagging records where two name columns are close but not identical. Compare the distance against a threshold to decide what counts as a "close match":

val df = Seq(
  ("Smith", "Smyth"),
  ("Johnson", "Jonson"),
  ("Williams", "Williamson"),
  ("Garcia", "Garsia"),
  ("Brown", "Brawn"),
).toDF("name1", "name2")

val df2 = df
  .withColumn("distance", levenshtein(col("name1"), col("name2")))
  .withColumn("close_match", levenshtein(col("name1"), col("name2")) <= 2)

df2.show(false)
// +--------+----------+--------+-----------+
// |name1   |name2     |distance|close_match|
// +--------+----------+--------+-----------+
// |Smith   |Smyth     |1       |true       |
// |Johnson |Jonson    |1       |true       |
// |Williams|Williamson|2       |true       |
// |Garcia  |Garsia    |1       |true       |
// |Brown   |Brawn     |1       |true       |
// +--------+----------+--------+-----------+

This pattern works well for deduplication pipelines where you need to find records that likely refer to the same entity despite spelling variations. Combine it with lower to normalize case before comparing.

Handling Nulls and Empty Strings

When either input is null, the result is null. An empty string is treated as a valid string — the distance from an empty string to another string equals the length of the other string:

val df = Seq(
  ("Alice", "Alice"),
  ("Bob", null),
  (null, "Carol"),
  (null, null),
  ("", "test"),
).toDF("string1", "string2")

val df2 = df
  .withColumn("distance", levenshtein(col("string1"), col("string2")))

df2.show(false)
// +-------+-------+--------+
// |string1|string2|distance|
// +-------+-------+--------+
// |Alice  |Alice  |0       |
// |Bob    |null   |null    |
// |null   |Carol  |null    |
// |null   |null   |null    |
// |       |test   |4       |
// +-------+-------+--------+

The empty string to "test" distance is 4 because four insertions are needed.

For phonetic matching (grouping names that sound alike regardless of spelling), see soundex. For pattern-based matching with wildcards, see like and ilike.

Spark Scala Levenshtein Distance

Fuzzy Name Matching

Handling Nulls and Empty Strings

Related Functions