Job Board
Consulting

Spark Scala Substring Functions

substring and substring_index are two complementary ways to extract a portion of a string column. Use substring when you know the character position; use substring_index when you want to split on a delimiter.

substring

def substring(str: Column, pos: Int, len: Int): Column

substring returns len characters starting at position pos. Positions are 1-based — the first character is at position 1. If pos is negative, it counts back from the end of the string.

Here's a practical example extracting parts of a formatted phone number:

val df = Seq(
  ("Alice Chen",    "555-867-5309"),
  ("Bob Martinez",  "555-234-5678"),
  ("Diana Okafor",  "555-987-6543"),
  ("Evan Patel",    "555-111-2222"),
).toDF("name", "phone")

val df2 = df
  .withColumn("area_code", substring(col("phone"), 1, 3))
  .withColumn("exchange",  substring(col("phone"), 5, 3))
  .withColumn("last_four", substring(col("phone"), 9, 4))

df2.show(false)
// +------------+------------+---------+--------+---------+
// |name        |phone       |area_code|exchange|last_four|
// +------------+------------+---------+--------+---------+
// |Alice Chen  |555-867-5309|555      |867     |5309     |
// |Bob Martinez|555-234-5678|555      |234     |5678     |
// |Diana Okafor|555-987-6543|555      |987     |6543     |
// |Evan Patel  |555-111-2222|555      |111     |2222     |
// +------------+------------+---------+--------+---------+

Negative positions count from the end. A pos of -4 starts 4 characters from the right. This is useful when your data has a fixed-length suffix at a known distance from the end — like a year embedded at the tail of a structured code:

val df = Seq(
  "PROD-ABC-2024",
  "PROD-XYZ-2023",
  "PROD-DEF-2022",
  "PROD-GHI-2021",
).toDF("product_code")

val df2 = df
  .withColumn("category", substring(col("product_code"), 6, 3))
  .withColumn("year",     substring(col("product_code"), -4, 4))

df2.show(false)
// +-------------+--------+----+
// |product_code |category|year|
// +-------------+--------+----+
// |PROD-ABC-2024|ABC     |2024|
// |PROD-XYZ-2023|XYZ     |2023|
// |PROD-DEF-2022|DEF     |2022|
// |PROD-GHI-2021|GHI     |2021|
// +-------------+--------+----+

When substring encounters a null value, the result is null.

substring_index

def substring_index(str: Column, delim: String, count: Int): Column

substring_index splits a string on a delimiter and returns everything to the left of the Nth occurrence. When count is positive, it counts occurrences from the left. When count is negative, it counts from the right — returning everything to the right of the Nth occurrence from the end.

IP addresses are a natural fit for this function since the octets are .-delimited:

val df = Seq(
  "192.168.1.100",
  "10.0.0.1",
  "172.16.254.1",
  "203.0.113.42",
).toDF("ip_address")

val df2 = df
  .withColumn("first_octet",  substring_index(col("ip_address"), ".", 1))
  .withColumn("first_two",    substring_index(col("ip_address"), ".", 2))
  .withColumn("last_octet",   substring_index(col("ip_address"), ".", -1))

df2.show(false)
// +-------------+-----------+---------+----------+
// |ip_address   |first_octet|first_two|last_octet|
// +-------------+-----------+---------+----------+
// |192.168.1.100|192        |192.168  |100       |
// |10.0.0.1     |10         |10.0     |1         |
// |172.16.254.1 |172        |172.16   |1         |
// |203.0.113.42 |203        |203.0    |42        |
// +-------------+-----------+---------+----------+

first_octet uses count=1 — everything before the first .. first_two uses count=2 — everything before the second .. last_octet uses count=-1 — everything after the last ..

The delimiter can be any string, not just a single character. If the delimiter isn't found in the string, the entire string is returned.

For more string manipulation options, see instr and locate to find the position of a substring, split to break a string into an array on a delimiter, trim, ltrim, and rtrim for whitespace handling, lpad and rpad for fixed-width formatting, or regexp_replace for pattern-based substitution.

Example Details

Created: 2026-03-17 02:40:44 PM

Last Updated: 2026-03-17 02:40:44 PM