Spark Scala URL Encode, Decode, and Parse
url_encode converts a string into application/x-www-form-urlencoded format so it can be safely used in a URL. url_decode reverses that transformation. parse_url extracts pieces of a URL — the host, path, query string, or a specific query parameter. All three are Spark SQL functions, so you call them through expr().
Encoding strings for URLs
url_encode(str) - Translates str to application/x-www-form-urlencoded format
url_encode takes a string and returns a percent-encoded version. Spaces become +, reserved characters like :, /, &, and = become %XX escape sequences, and non-ASCII characters are encoded using UTF-8 byte values.
The url_encode function first appeared in version 3.4.0 and is defined as:
val df = Seq(
"https://spark.apache.org",
"hello world",
"name=Ada Lovelace&city=London",
"café & croissant",
).toDF("input")
val df2 = df
.withColumn("encoded", expr("url_encode(input)"))
df2.show(false)
// +-----------------------------+-----------------------------------+
// |input |encoded |
// +-----------------------------+-----------------------------------+
// |https://spark.apache.org |https%3A%2F%2Fspark.apache.org |
// |hello world |hello+world |
// |name=Ada Lovelace&city=London|name%3DAda+Lovelace%26city%3DLondon|
// |café & croissant |caf%C3%A9+%26+croissant |
// +-----------------------------+-----------------------------------+
Notice that = becomes %3D, & becomes %26, and the é in café is encoded as its two UTF-8 bytes %C3%A9. This is the format you want when building a URL query string from user input or other data.
Decoding URL-encoded strings
url_decode(str) - Decodes str from application/x-www-form-urlencoded format
url_decode is the inverse of url_encode. It converts + back to a space and %XX sequences back to their original characters.
The url_decode function first appeared in version 3.4.0 and is defined as:
val df = Seq(
"https%3A%2F%2Fspark.apache.org",
"hello+world",
"name%3DAda+Lovelace%26city%3DLondon",
"caf%C3%A9+%26+croissant",
).toDF("input")
val df2 = df
.withColumn("decoded", expr("url_decode(input)"))
df2.show(false)
// +-----------------------------------+-----------------------------+
// |input |decoded |
// +-----------------------------------+-----------------------------+
// |https%3A%2F%2Fspark.apache.org |https://spark.apache.org |
// |hello+world |hello world |
// |name%3DAda+Lovelace%26city%3DLondon|name=Ada Lovelace&city=London|
// |caf%C3%A9+%26+croissant |café & croissant |
// +-----------------------------------+-----------------------------+
Each row in the decoded column matches the original input from the previous example — encoding and then decoding with these two functions is a lossless round trip.
Extracting parts of a URL
parse_url(url, partToExtract) - Extracts a part from a URL
parse_url(url, partToExtract, key) - Extracts a specific query parameter from a URL
parse_url pulls named components out of a URL string. The partToExtract argument is a string literal that selects which piece you want. Common values are:
PROTOCOL— the scheme (http,https)HOST— the domain namePATH— everything from the first/after the host up to the query stringQUERY— everything after the?, without the?itselfREF— the fragment after#AUTHORITY—userinfo@host:portFILE— path plus query stringUSERINFO— theuser:passportion of the authority
val df = Seq(
"https://spark.apache.org/docs/latest/?tab=scala",
"http://example.com:8080/path/to/file.html#section",
"https://user:pass@api.example.com/v2/users?id=42&sort=asc",
).toDF("url")
val df2 = df
.withColumn("protocol", expr("parse_url(url, 'PROTOCOL')"))
.withColumn("host", expr("parse_url(url, 'HOST')"))
.withColumn("path", expr("parse_url(url, 'PATH')"))
.withColumn("query", expr("parse_url(url, 'QUERY')"))
df2.show(false)
// +---------------------------------------------------------+--------+----------------+------------------+--------------+
// |url |protocol|host |path |query |
// +---------------------------------------------------------+--------+----------------+------------------+--------------+
// |https://spark.apache.org/docs/latest/?tab=scala |https |spark.apache.org|/docs/latest/ |tab=scala |
// |http://example.com:8080/path/to/file.html#section |http |example.com |/path/to/file.html|null |
// |https://user:pass@api.example.com/v2/users?id=42&sort=asc|https |api.example.com |/v2/users |id=42&sort=asc|
// +---------------------------------------------------------+--------+----------------+------------------+--------------+
When a URL doesn't contain the requested part — like the second row, which has no query string — parse_url returns null for that column. The HOST extraction strips the port, userinfo, and anything after the path automatically.
Extracting a specific query parameter
When you pass a third argument to parse_url with QUERY as the part, Spark looks up a specific query parameter by name and returns just its value:
val df = Seq(
"https://shop.example.com/search?q=laptop&sort=price&page=2",
"https://shop.example.com/search?q=keyboard&sort=rating",
"https://shop.example.com/search?q=monitor&page=1",
).toDF("url")
val df2 = df
.withColumn("search_term", expr("parse_url(url, 'QUERY', 'q')"))
.withColumn("sort_by", expr("parse_url(url, 'QUERY', 'sort')"))
.withColumn("page", expr("parse_url(url, 'QUERY', 'page')"))
df2.show(false)
// +----------------------------------------------------------+-----------+-------+----+
// |url |search_term|sort_by|page|
// +----------------------------------------------------------+-----------+-------+----+
// |https://shop.example.com/search?q=laptop&sort=price&page=2|laptop |price |2 |
// |https://shop.example.com/search?q=keyboard&sort=rating |keyboard |rating |null|
// |https://shop.example.com/search?q=monitor&page=1 |monitor |null |1 |
// +----------------------------------------------------------+-----------+-------+----+
Rows missing a given parameter get null for that column — the second URL has no page parameter, and the third has no sort. This is a lot cleaner than splitting the query string yourself and parsing each key=value pair.
Note that parse_url returns the value as-is — if the value itself is URL-encoded, wrap the call in url_decode to get the original text:
scala
df.withColumn("term", expr("url_decode(parse_url(url, 'QUERY', 'q'))"))
Null handling
All three functions return null when the input URL is null. No exception is thrown and no empty string is substituted.
val df = Seq(
("Alice", "https://spark.apache.org/docs/"),
("Bob", null),
("Carol", "https://example.com/?ref=home"),
).toDF("name", "url")
val df2 = df
.withColumn("encoded", expr("url_encode(url)"))
.withColumn("host", expr("parse_url(url, 'HOST')"))
df2.show(false)
// +-----+------------------------------+-----------------------------------------+----------------+
// |name |url |encoded |host |
// +-----+------------------------------+-----------------------------------------+----------------+
// |Alice|https://spark.apache.org/docs/|https%3A%2F%2Fspark.apache.org%2Fdocs%2F |spark.apache.org|
// |Bob |null |null |null |
// |Carol|https://example.com/?ref=home |https%3A%2F%2Fexample.com%2F%3Fref%3Dhome|example.com |
// +-----+------------------------------+-----------------------------------------+----------------+
Bob's null URL passes through as null for both url_encode and parse_url, following Spark's standard null propagation rules.
Related functions
For converting between strings and binary representations with a specific character set, see decode and encode. For hexadecimal string conversions, see hex and unhex. For Base64 encoding, see base64 and unbase64. For extracting substrings using regular expressions — useful when parse_url doesn't cover your exact case — see regexp_replace.