Spark Scala Decode and Encode
encode converts a string column to binary using a specified character set. decode does the reverse — it converts binary data back to a string. Together they let you move between string and binary representations, which is useful when working with systems that expect raw bytes or when you need to control the character encoding explicitly.
Encoding strings to binary
def encode(value: Column, charset: String): Column
encode takes a string column and a character set name (like "UTF-8") and returns a binary column containing the encoded bytes.
val df = Seq(
"San Francisco",
"New York",
"Chicago",
"Austin",
"Portland",
).toDF("city")
val df2 = df
.withColumn("city_binary", encode(col("city"), "UTF-8"))
df2.show(false)
// +-------------+----------------------------------------+
// |city |city_binary |
// +-------------+----------------------------------------+
// |San Francisco|[53 61 6E 20 46 72 61 6E 63 69 73 63 6F]|
// |New York |[4E 65 77 20 59 6F 72 6B] |
// |Chicago |[43 68 69 63 61 67 6F] |
// |Austin |[41 75 73 74 69 6E] |
// |Portland |[50 6F 72 74 6C 61 6E 64] |
// +-------------+----------------------------------------+
The binary column displays each byte in hex notation. Spark supports any character set that the JVM recognizes — UTF-8, UTF-16, ISO-8859-1, US-ASCII, and others.
Decoding binary back to strings
def decode(value: Column, charset: String): Column
decode takes a binary column and a character set name and returns a string column. It's the inverse of encode — encoding a string and then decoding with the same charset gives back the original value.
val df = Seq(
"San Francisco",
"New York",
"Chicago",
"Austin",
"Portland",
).toDF("city")
val df2 = df
.withColumn("city_binary", encode(col("city"), "UTF-8"))
.withColumn("city_decoded", decode(col("city_binary"), "UTF-8"))
df2.show(false)
// +-------------+----------------------------------------+-------------+
// |city |city_binary |city_decoded |
// +-------------+----------------------------------------+-------------+
// |San Francisco|[53 61 6E 20 46 72 61 6E 63 69 73 63 6F]|San Francisco|
// |New York |[4E 65 77 20 59 6F 72 6B] |New York |
// |Chicago |[43 68 69 63 61 67 6F] |Chicago |
// |Austin |[41 75 73 74 69 6E] |Austin |
// |Portland |[50 6F 72 74 6C 61 6E 64] |Portland |
// +-------------+----------------------------------------+-------------+
The city_decoded column matches the original city column exactly, confirming the round-trip works as expected.
Null handling
Both encode and decode return null when the input is null. This follows Spark's standard null propagation — no exception, no empty string.
val df = Seq(
("Alice", "alice@example.com"),
("Bob", "bob@example.com"),
("Carol", null),
("Dave", "dave@example.com"),
).toDF("name", "email")
val df2 = df
.withColumn("email_binary", encode(col("email"), "UTF-8"))
.withColumn("email_decoded", decode(col("email_binary"), "UTF-8"))
df2.show(false)
// +-----+-----------------+----------------------------------------------------+-----------------+
// |name |email |email_binary |email_decoded |
// +-----+-----------------+----------------------------------------------------+-----------------+
// |Alice|alice@example.com|[61 6C 69 63 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D]|alice@example.com|
// |Bob |bob@example.com |[62 6F 62 40 65 78 61 6D 70 6C 65 2E 63 6F 6D] |bob@example.com |
// |Carol|null |null |null |
// |Dave |dave@example.com |[64 61 76 65 40 65 78 61 6D 70 6C 65 2E 63 6F 6D] |dave@example.com |
// +-----+-----------------+----------------------------------------------------+-----------------+
Carol's null email flows through as null for both encode and decode.
Related functions
For Base64 encoding and decoding, see base64 and unbase64. For hexadecimal conversions, see hex and unhex. For cryptographic hashing that produces hex output, see the hashing functions.