Spark Scala String Length Functions
Spark provides three functions for measuring string size: length counts characters, octet_length counts bytes, and bit_length counts bits. For ASCII text they all agree, but they diverge once you have Unicode characters — which matters whenever you're validating input lengths or working with encoded data.
def length(e: Column): Column
length returns the number of characters in a string. For ASCII strings one character equals one byte, but for multi-byte Unicode characters (accented letters, CJK characters, etc.) length gives you the human-visible character count, not the storage size.
Here's a basic example measuring city names:
val df = Seq(
"San Francisco",
"Los Angeles",
"New York",
"Chicago",
"Seattle",
).toDF("city")
val df2 = df
.withColumn("char_count", length(col("city")))
df2.show(false)
// +-------------+----------+
// |city |char_count|
// +-------------+----------+
// |San Francisco|13 |
// |Los Angeles |11 |
// |New York |8 |
// |Chicago |7 |
// |Seattle |7 |
// +-------------+----------+
Byte-level functions: bit_length and octet_length
The bit_length function first appeared in version 3.3.0 and is defined as:
def bit_length(e: Column): Column
The octet_length function first appeared in version 3.3.0 and is defined as:
def octet_length(e: Column): Column
octet_length returns the number of bytes used to store the string (an octet is 8 bits, i.e. one byte). bit_length returns the same value multiplied by 8. For pure ASCII strings these match length, but for Unicode strings they'll be larger — Spark stores strings as UTF-8, where non-ASCII characters take 2–4 bytes each.
This example shows all three side by side on a mix of ASCII, accented, and CJK characters:
val df = Seq(
"hello",
"café",
"naïve",
"résumé",
"日本語",
).toDF("word")
val df2 = df
.withColumn("length", length(col("word")))
.withColumn("bit_length", bit_length(col("word")))
.withColumn("octet_length", octet_length(col("word")))
df2.show(false)
// +------+------+----------+------------+
// |word |length|bit_length|octet_length|
// +------+------+----------+------------+
// |hello |5 |40 |5 |
// |café |4 |40 |5 |
// |naïve |5 |48 |6 |
// |résumé|6 |64 |8 |
// |日本語|3 |72 |9 |
// +------+------+----------+------------+
A few things worth noting from the output:
hello— 5 ASCII characters, 5 bytes, 40 bits. All three measures agree.café— 4 characters (c, a, f, é), butéis 2 bytes in UTF-8, sooctet_lengthis 5.日本語— 3 CJK characters, each 3 bytes in UTF-8, givingoctet_lengthof 9.
Use length when you care about visible character count (e.g., validating a username is at most 20 characters). Use octet_length when you care about storage size or are working with byte-length limits (e.g., a database column with a byte-length constraint).
SQL Aliases: len, char_length, and character_length
Spark SQL provides len, char_length, and character_length as SQL-compatible aliases for length. These aren't available as Scala API functions, but you can use them via expr():
val df = Seq(
"San Francisco",
"Los Angeles",
"New York",
).toDF("city")
val df2 = df
.withColumn("len", expr("len(city)"))
.withColumn("char_length", expr("char_length(city)"))
.withColumn("character_length", expr("character_length(city)"))
df2.show(false)
// +-------------+---+-----------+----------------+
// |city |len|char_length|character_length|
// +-------------+---+-----------+----------------+
// |San Francisco|13 |13 |13 |
// |Los Angeles |11 |11 |11 |
// |New York |8 |8 |8 |
// +-------------+---+-----------+----------------+
All three produce the same result as length. They're there for SQL compatibility — useful if you're porting SQL queries into Spark or working with a team that has a SQL background.
For other string functions, see trim, ltrim, and rtrim for whitespace handling, lpad and rpad for padding strings to a fixed length, or substring for extracting characters by position.