Spark Scala String Length Functions

Spark provides three functions for measuring string size: length counts characters, octet_length counts bytes, and bit_length counts bits. For ASCII text they all agree, but they diverge once you have Unicode characters — which matters whenever you're validating input lengths or working with encoded data.

def length(e: Column): Column

length returns the number of characters in a string. For ASCII strings one character equals one byte, but for multi-byte Unicode characters (accented letters, CJK characters, etc.) length gives you the human-visible character count, not the storage size.

Here's a basic example measuring city names:

val df = Seq(
  "San Francisco",
  "Los Angeles",
  "New York",
  "Chicago",
  "Seattle",
).toDF("city")

val df2 = df
  .withColumn("char_count", length(col("city")))

df2.show(false)
// +-------------+----------+
// |city         |char_count|
// +-------------+----------+
// |San Francisco|13        |
// |Los Angeles  |11        |
// |New York     |8         |
// |Chicago      |7         |
// |Seattle      |7         |
// +-------------+----------+

Byte-level functions: bit_length and octet_length

The bit_length function first appeared in version 3.3.0 and is defined as:

def bit_length(e: Column): Column

The octet_length function first appeared in version 3.3.0 and is defined as:

def octet_length(e: Column): Column

octet_length returns the number of bytes used to store the string (an octet is 8 bits, i.e. one byte). bit_length returns the same value multiplied by 8. For pure ASCII strings these match length, but for Unicode strings they'll be larger — Spark stores strings as UTF-8, where non-ASCII characters take 2–4 bytes each.

This example shows all three side by side on a mix of ASCII, accented, and CJK characters:

val df = Seq(
  "hello",
  "café",
  "naïve",
  "résumé",
  "日本語",
).toDF("word")

val df2 = df
  .withColumn("length", length(col("word")))
  .withColumn("bit_length", bit_length(col("word")))
  .withColumn("octet_length", octet_length(col("word")))

df2.show(false)
// +------+------+----------+------------+
// |word  |length|bit_length|octet_length|
// +------+------+----------+------------+
// |hello |5     |40        |5           |
// |café  |4     |40        |5           |
// |naïve |5     |48        |6           |
// |résumé|6     |64        |8           |
// |日本語|3     |72        |9           |
// +------+------+----------+------------+

A few things worth noting from the output:

hello — 5 ASCII characters, 5 bytes, 40 bits. All three measures agree.
café — 4 characters (c, a, f, é), but é is 2 bytes in UTF-8, so octet_length is 5.
日本語 — 3 CJK characters, each 3 bytes in UTF-8, giving octet_length of 9.

Use length when you care about visible character count (e.g., validating a username is at most 20 characters). Use octet_length when you care about storage size or are working with byte-length limits (e.g., a database column with a byte-length constraint).

SQL Aliases: len, char_length, and character_length

Spark SQL provides len, char_length, and character_length as SQL-compatible aliases for length. These aren't available as Scala API functions, but you can use them via expr():

val df = Seq(
  "San Francisco",
  "Los Angeles",
  "New York",
).toDF("city")

val df2 = df
  .withColumn("len", expr("len(city)"))
  .withColumn("char_length", expr("char_length(city)"))
  .withColumn("character_length", expr("character_length(city)"))

df2.show(false)
// +-------------+---+-----------+----------------+
// |city         |len|char_length|character_length|
// +-------------+---+-----------+----------------+
// |San Francisco|13 |13         |13              |
// |Los Angeles  |11 |11         |11              |
// |New York     |8  |8          |8               |
// +-------------+---+-----------+----------------+

All three produce the same result as length. They're there for SQL compatibility — useful if you're porting SQL queries into Spark or working with a team that has a SQL background.

For other string functions, see trim, ltrim, and rtrim for whitespace handling, lpad and rpad for padding strings to a fixed length, or substring for extracting characters by position.