Job Board
Consulting

Spark Scala Substring Functions

substring and substring_index are two complementary ways to extract a portion of a string column. Use substring when you know the character position; use substring_index when you want to split on a delimiter.

substring

def substring(str: Column, pos: Int, len: Int): Column

substring returns len characters starting at position pos. Positions are 1-based — the first character is at position 1. If pos is negative, it counts back from the end of the string.

Here's a practical example extracting parts of a formatted phone number:

val df = Seq(
  ("Alice Chen",    "555-867-5309"),
  ("Bob Martinez",  "555-234-5678"),
  ("Diana Okafor",  "555-987-6543"),
  ("Evan Patel",    "555-111-2222"),
).toDF("name", "phone")

val df2 = df
  .withColumn("area_code", substring(col("phone"), 1, 3))
  .withColumn("exchange",  substring(col("phone"), 5, 3))
  .withColumn("last_four", substring(col("phone"), 9, 4))

df2.show(false)
// +------------+------------+---------+--------+---------+
// |name        |phone       |area_code|exchange|last_four|
// +------------+------------+---------+--------+---------+
// |Alice Chen  |555-867-5309|555      |867     |5309     |
// |Bob Martinez|555-234-5678|555      |234     |5678     |
// |Diana Okafor|555-987-6543|555      |987     |6543     |
// |Evan Patel  |555-111-2222|555      |111     |2222     |
// +------------+------------+---------+--------+---------+

Negative positions count from the end. A pos of -4 starts 4 characters from the right. This is useful when your data has a fixed-length suffix at a known distance from the end — like a year embedded at the tail of a structured code:

val df = Seq(
  "PROD-ABC-2024",
  "PROD-XYZ-2023",
  "PROD-DEF-2022",
  "PROD-GHI-2021",
).toDF("product_code")

val df2 = df
  .withColumn("category", substring(col("product_code"), 6, 3))
  .withColumn("year",     substring(col("product_code"), -4, 4))

df2.show(false)
// +-------------+--------+----+
// |product_code |category|year|
// +-------------+--------+----+
// |PROD-ABC-2024|ABC     |2024|
// |PROD-XYZ-2023|XYZ     |2023|
// |PROD-DEF-2022|DEF     |2022|
// |PROD-GHI-2021|GHI     |2021|
// +-------------+--------+----+

When substring encounters a null value, the result is null.

substring_index

def substring_index(str: Column, delim: String, count: Int): Column

substring_index splits a string on a delimiter and returns everything to the left of the Nth occurrence. When count is positive, it counts occurrences from the left. When count is negative, it counts from the right — returning everything to the right of the Nth occurrence from the end.

IP addresses are a natural fit for this function since the octets are .-delimited:

val df = Seq(
  "192.168.1.100",
  "10.0.0.1",
  "172.16.254.1",
  "203.0.113.42",
).toDF("ip_address")

val df2 = df
  .withColumn("first_octet",  substring_index(col("ip_address"), ".", 1))
  .withColumn("first_two",    substring_index(col("ip_address"), ".", 2))
  .withColumn("last_octet",   substring_index(col("ip_address"), ".", -1))

df2.show(false)
// +-------------+-----------+---------+----------+
// |ip_address   |first_octet|first_two|last_octet|
// +-------------+-----------+---------+----------+
// |192.168.1.100|192        |192.168  |100       |
// |10.0.0.1     |10         |10.0     |1         |
// |172.16.254.1 |172        |172.16   |1         |
// |203.0.113.42 |203        |203.0    |42        |
// +-------------+-----------+---------+----------+

first_octet uses count=1 — everything before the first .. first_two uses count=2 — everything before the second .. last_octet uses count=-1 — everything after the last ..

The delimiter can be any string, not just a single character. If the delimiter isn't found in the string, the entire string is returned.

For more string manipulation options, see trim, ltrim, and rtrim for whitespace handling, lpad and rpad for fixed-width formatting, or regexp_replace for pattern-based substitution.

Example Details

Created: 2026-03-17 02:40:44 PM

Last Updated: 2026-03-17 02:40:44 PM