The Reference You Need
Spark Scala Examples
Simple spark scala examples to help you quickly complete your data etl pipelines. Save time digging through the spark scala function api and instead get right to the code you need...
Page 1 of 5
-
url_encode, url_decode, and parse_url in Spark Scala: Work With URLs in a DataFrame
url_encode converts a string into application/x-www-form-urlencoded format so it can be safely used in a URL. url_decode reverses that transformation. parse_url extracts pieces of a URL — the host, path, query string, or a specific query parameter. All three are Spark SQL functions, so you call them through expr().
-
mask in Spark Scala: Mask Sensitive Data in a DataFrame Column
mask replaces characters in a string with masking characters — uppercase letters become X, lowercase letters become x, and digits become n by default. It's a quick way to redact sensitive data like emails, phone numbers, and identifiers without destroying the structure of the value.
-
btrim in Spark Scala: Trim Both Sides of a String in a DataFrame Column
btrim strips characters from both ends of a string. It's the SQL-standard equivalent of trim — useful when you're writing Spark SQL expressions or prefer the more explicit name.
-
decode and encode in Spark Scala: Convert Between Strings and Binary in DataFrames
encode converts a string column to binary using a specified character set. decode does the reverse — it converts binary data back to a string. Together they let you move between string and binary representations, which is useful when working with systems that expect raw bytes or when you need to control the character encoding explicitly.
-
Format String in Spark Scala: Printf-Style Column Formatting for DataFrames
The format_string function formats column values into a string using printf-style patterns. It's useful for building human-readable labels, combining columns into structured text, or formatting numbers inline without changing their type first.
-
find_in_set in Spark Scala: Search Comma-Delimited Strings in a DataFrame
find_in_set returns the 1-based position of a string within a comma-delimited list stored in another column. It returns 0 if the string isn't found and null if either input is null.
-
Sentences: Tokenize Text into Words and Sentences in Spark Scala DataFrames
sentences splits a string into an array of sentences, where each sentence is an array of words. It's useful for text analysis tasks like counting sentences, extracting individual words, or preparing text for downstream NLP processing.
-
levenshtein in Spark Scala: Measure String Distance in a DataFrame
The levenshtein function computes the Levenshtein distance between two string columns — the minimum number of single-character edits (insertions, deletions, or substitutions) needed to transform one string into the other. It's useful for fuzzy matching, typo detection, and deduplication.
-
soundex in Spark Scala: Phonetic Matching in a DataFrame Column
The soundex function returns the soundex code of a string column — a four-character phonetic encoding that groups similar-sounding names together. It's useful for fuzzy matching, deduplication, and search where exact spelling varies.
-
hex and unhex in Spark Scala: Hexadecimal Conversion in DataFrames
hex converts an integer or string column to its hexadecimal representation. unhex does the reverse — it decodes a hex string back to binary. These are useful when working with low-level data formats, color codes, or any system that uses hex encoding.