Spark Scala Substring Functions
substring and substring_index are two complementary ways to extract a portion of a string column. Use substring when you know the character position; use substring_index when you want to split on a delimiter.
substring
def substring(str: Column, pos: Int, len: Int): Column
substring returns len characters starting at position pos. Positions are 1-based — the first character is at position 1. If pos is negative, it counts back from the end of the string.
Here's a practical example extracting parts of a formatted phone number:
val df = Seq(
("Alice Chen", "555-867-5309"),
("Bob Martinez", "555-234-5678"),
("Diana Okafor", "555-987-6543"),
("Evan Patel", "555-111-2222"),
).toDF("name", "phone")
val df2 = df
.withColumn("area_code", substring(col("phone"), 1, 3))
.withColumn("exchange", substring(col("phone"), 5, 3))
.withColumn("last_four", substring(col("phone"), 9, 4))
df2.show(false)
// +------------+------------+---------+--------+---------+
// |name |phone |area_code|exchange|last_four|
// +------------+------------+---------+--------+---------+
// |Alice Chen |555-867-5309|555 |867 |5309 |
// |Bob Martinez|555-234-5678|555 |234 |5678 |
// |Diana Okafor|555-987-6543|555 |987 |6543 |
// |Evan Patel |555-111-2222|555 |111 |2222 |
// +------------+------------+---------+--------+---------+
Negative positions count from the end. A pos of -4 starts 4 characters from the right. This is useful when your data has a fixed-length suffix at a known distance from the end — like a year embedded at the tail of a structured code:
val df = Seq(
"PROD-ABC-2024",
"PROD-XYZ-2023",
"PROD-DEF-2022",
"PROD-GHI-2021",
).toDF("product_code")
val df2 = df
.withColumn("category", substring(col("product_code"), 6, 3))
.withColumn("year", substring(col("product_code"), -4, 4))
df2.show(false)
// +-------------+--------+----+
// |product_code |category|year|
// +-------------+--------+----+
// |PROD-ABC-2024|ABC |2024|
// |PROD-XYZ-2023|XYZ |2023|
// |PROD-DEF-2022|DEF |2022|
// |PROD-GHI-2021|GHI |2021|
// +-------------+--------+----+
When substring encounters a null value, the result is null.
substring_index
def substring_index(str: Column, delim: String, count: Int): Column
substring_index splits a string on a delimiter and returns everything to the left of the Nth occurrence. When count is positive, it counts occurrences from the left. When count is negative, it counts from the right — returning everything to the right of the Nth occurrence from the end.
IP addresses are a natural fit for this function since the octets are .-delimited:
val df = Seq(
"192.168.1.100",
"10.0.0.1",
"172.16.254.1",
"203.0.113.42",
).toDF("ip_address")
val df2 = df
.withColumn("first_octet", substring_index(col("ip_address"), ".", 1))
.withColumn("first_two", substring_index(col("ip_address"), ".", 2))
.withColumn("last_octet", substring_index(col("ip_address"), ".", -1))
df2.show(false)
// +-------------+-----------+---------+----------+
// |ip_address |first_octet|first_two|last_octet|
// +-------------+-----------+---------+----------+
// |192.168.1.100|192 |192.168 |100 |
// |10.0.0.1 |10 |10.0 |1 |
// |172.16.254.1 |172 |172.16 |1 |
// |203.0.113.42 |203 |203.0 |42 |
// +-------------+-----------+---------+----------+
first_octet uses count=1 — everything before the first .. first_two uses count=2 — everything before the second .. last_octet uses count=-1 — everything after the last ..
The delimiter can be any string, not just a single character. If the delimiter isn't found in the string, the entire string is returned.
Related functions
For more string manipulation options, see trim, ltrim, and rtrim for whitespace handling, lpad and rpad for fixed-width formatting, or regexp_replace for pattern-based substitution.