Spark Scala URL Encode, Decode, and Parse

url_encode converts a string into application/x-www-form-urlencoded format so it can be safely used in a URL. url_decode reverses that transformation. parse_url extracts pieces of a URL — the host, path, query string, or a specific query parameter. All three are Spark SQL functions, so you call them through expr().

Encoding strings for URLs

url_encode(str) - Translates str to application/x-www-form-urlencoded format

url_encode takes a string and returns a percent-encoded version. Spaces become +, reserved characters like :, /, &, and = become %XX escape sequences, and non-ASCII characters are encoded using UTF-8 byte values.

The url_encode function first appeared in version 3.4.0 and is defined as:

val df = Seq(
  "https://spark.apache.org",
  "hello world",
  "name=Ada Lovelace&city=London",
  "café & croissant",
).toDF("input")

val df2 = df
  .withColumn("encoded", expr("url_encode(input)"))

df2.show(false)
// +-----------------------------+-----------------------------------+
// |input                        |encoded                            |
// +-----------------------------+-----------------------------------+
// |https://spark.apache.org     |https%3A%2F%2Fspark.apache.org     |
// |hello world                  |hello+world                        |
// |name=Ada Lovelace&city=London|name%3DAda+Lovelace%26city%3DLondon|
// |café & croissant             |caf%C3%A9+%26+croissant            |
// +-----------------------------+-----------------------------------+

Notice that = becomes %3D, & becomes %26, and the é in café is encoded as its two UTF-8 bytes %C3%A9. This is the format you want when building a URL query string from user input or other data.

Decoding URL-encoded strings

url_decode(str) - Decodes str from application/x-www-form-urlencoded format

url_decode is the inverse of url_encode. It converts + back to a space and %XX sequences back to their original characters.

The url_decode function first appeared in version 3.4.0 and is defined as:

val df = Seq(
  "https%3A%2F%2Fspark.apache.org",
  "hello+world",
  "name%3DAda+Lovelace%26city%3DLondon",
  "caf%C3%A9+%26+croissant",
).toDF("input")

val df2 = df
  .withColumn("decoded", expr("url_decode(input)"))

df2.show(false)
// +-----------------------------------+-----------------------------+
// |input                              |decoded                      |
// +-----------------------------------+-----------------------------+
// |https%3A%2F%2Fspark.apache.org     |https://spark.apache.org     |
// |hello+world                        |hello world                  |
// |name%3DAda+Lovelace%26city%3DLondon|name=Ada Lovelace&city=London|
// |caf%C3%A9+%26+croissant            |café & croissant             |
// +-----------------------------------+-----------------------------+

Each row in the decoded column matches the original input from the previous example — encoding and then decoding with these two functions is a lossless round trip.

Extracting parts of a URL

parse_url(url, partToExtract) - Extracts a part from a URL

parse_url(url, partToExtract, key) - Extracts a specific query parameter from a URL

parse_url pulls named components out of a URL string. The partToExtract argument is a string literal that selects which piece you want. Common values are:

PROTOCOL — the scheme (http, https)
HOST — the domain name
PATH — everything from the first / after the host up to the query string
QUERY — everything after the ?, without the ? itself
REF — the fragment after #
AUTHORITY — userinfo@host:port
FILE — path plus query string
USERINFO — the user:pass portion of the authority

val df = Seq(
  "https://spark.apache.org/docs/latest/?tab=scala",
  "http://example.com:8080/path/to/file.html#section",
  "https://user:pass@api.example.com/v2/users?id=42&sort=asc",
).toDF("url")

val df2 = df
  .withColumn("protocol", expr("parse_url(url, 'PROTOCOL')"))
  .withColumn("host",     expr("parse_url(url, 'HOST')"))
  .withColumn("path",     expr("parse_url(url, 'PATH')"))
  .withColumn("query",    expr("parse_url(url, 'QUERY')"))

df2.show(false)
// +---------------------------------------------------------+--------+----------------+------------------+--------------+
// |url                                                      |protocol|host            |path              |query         |
// +---------------------------------------------------------+--------+----------------+------------------+--------------+
// |https://spark.apache.org/docs/latest/?tab=scala          |https   |spark.apache.org|/docs/latest/     |tab=scala     |
// |http://example.com:8080/path/to/file.html#section        |http    |example.com     |/path/to/file.html|null          |
// |https://user:pass@api.example.com/v2/users?id=42&sort=asc|https   |api.example.com |/v2/users         |id=42&sort=asc|
// +---------------------------------------------------------+--------+----------------+------------------+--------------+

When a URL doesn't contain the requested part — like the second row, which has no query string — parse_url returns null for that column. The HOST extraction strips the port, userinfo, and anything after the path automatically.

Extracting a specific query parameter

When you pass a third argument to parse_url with QUERY as the part, Spark looks up a specific query parameter by name and returns just its value:

val df = Seq(
  "https://shop.example.com/search?q=laptop&sort=price&page=2",
  "https://shop.example.com/search?q=keyboard&sort=rating",
  "https://shop.example.com/search?q=monitor&page=1",
).toDF("url")

val df2 = df
  .withColumn("search_term", expr("parse_url(url, 'QUERY', 'q')"))
  .withColumn("sort_by",     expr("parse_url(url, 'QUERY', 'sort')"))
  .withColumn("page",        expr("parse_url(url, 'QUERY', 'page')"))

df2.show(false)
// +----------------------------------------------------------+-----------+-------+----+
// |url                                                       |search_term|sort_by|page|
// +----------------------------------------------------------+-----------+-------+----+
// |https://shop.example.com/search?q=laptop&sort=price&page=2|laptop     |price  |2   |
// |https://shop.example.com/search?q=keyboard&sort=rating    |keyboard   |rating |null|
// |https://shop.example.com/search?q=monitor&page=1          |monitor    |null   |1   |
// +----------------------------------------------------------+-----------+-------+----+

Rows missing a given parameter get null for that column — the second URL has no page parameter, and the third has no sort. This is a lot cleaner than splitting the query string yourself and parsing each key=value pair.

Note that parse_url returns the value as-is — if the value itself is URL-encoded, wrap the call in url_decode to get the original text:

scala df.withColumn("term", expr("url_decode(parse_url(url, 'QUERY', 'q'))"))

Null handling

All three functions return null when the input URL is null. No exception is thrown and no empty string is substituted.

val df = Seq(
  ("Alice", "https://spark.apache.org/docs/"),
  ("Bob",   null),
  ("Carol", "https://example.com/?ref=home"),
).toDF("name", "url")

val df2 = df
  .withColumn("encoded", expr("url_encode(url)"))
  .withColumn("host",    expr("parse_url(url, 'HOST')"))

df2.show(false)
// +-----+------------------------------+-----------------------------------------+----------------+
// |name |url                           |encoded                                  |host            |
// +-----+------------------------------+-----------------------------------------+----------------+
// |Alice|https://spark.apache.org/docs/|https%3A%2F%2Fspark.apache.org%2Fdocs%2F |spark.apache.org|
// |Bob  |null                          |null                                     |null            |
// |Carol|https://example.com/?ref=home |https%3A%2F%2Fexample.com%2F%3Fref%3Dhome|example.com     |
// +-----+------------------------------+-----------------------------------------+----------------+

Bob's null URL passes through as null for both url_encode and parse_url, following Spark's standard null propagation rules.