Job Board
Consulting

Spark Scala instr and locate

instr and locate both find the position of a substring within a string column. They return the same result — the difference is just argument order and the fact that locate has an optional start-position parameter for finding occurrences beyond the first.

instr

def instr(str: Column, substring: String): Column

instr returns the 1-based character position of the first occurrence of substring in str. If the substring is not found, it returns 0. Null input returns null.

Here's an example that finds the position of @ in email addresses and the position of # in order status strings:

val df = Seq(
  ("alice@example.com",    "order #A1042 placed"),
  ("bob.smith@company.org", "order #B2201 shipped"),
  ("carol@mail.net",       "order #C3309 delivered"),
  ("dave@example.com",     "no order found"),
).toDF("email", "status")

val df2 = df
  .withColumn("at_pos",    instr(col("email"), "@"))
  .withColumn("order_pos", instr(col("status"), "#"))

df2.show(false)
// +---------------------+----------------------+------+---------+
// |email                |status                |at_pos|order_pos|
// +---------------------+----------------------+------+---------+
// |alice@example.com    |order #A1042 placed   |6     |7        |
// |bob.smith@company.org|order #B2201 shipped  |10    |7        |
// |carol@mail.net       |order #C3309 delivered|6     |7        |
// |dave@example.com     |no order found        |5     |0        |
// +---------------------+----------------------+------+---------+

at_pos tells you where the @ sits — useful for validation or splitting the email into username and domain. order_pos returns 0 for the last row because # doesn't appear in "no order found".

locate

def locate(substr: String, str: Column): Column

def locate(substr: String, str: Column, pos: Int): Column

locate does the same thing as instr but with the arguments reversed — the search string comes first, then the column. It also has a second signature that takes a pos parameter, which sets the character position to start searching from.

The pos parameter makes it possible to find the second (or third, etc.) occurrence of a substring. Here's an example using ISO date strings, where both hyphens are always at known positions:

val df = Seq(
  "2024-01-15",
  "2024-03-22",
  "2024-11-07",
  "2025-06-30",
).toDF("date_str")

val df2 = df
  .withColumn("first_dash",  locate("-", col("date_str")))
  .withColumn("second_dash", locate("-", col("date_str"), 6))

df2.show(false)
// +----------+----------+-----------+
// |date_str  |first_dash|second_dash|
// +----------+----------+-----------+
// |2024-01-15|5         |8          |
// |2024-03-22|5         |8          |
// |2024-11-07|5         |8          |
// |2025-06-30|5         |8          |
// +----------+----------+-----------+

first_dash finds the first - at position 5 (after 2024). second_dash starts searching from position 6, skipping the first hyphen, and finds the second - at position 8 (after 2024-01). Passing pos=1 is equivalent to the two-argument form.

position (SQL alias)

position is a SQL alias for locate. It's not available as a Scala API function, but you can use it via expr():

val df = Seq(
  "product:laptop:electronics",
  "product:headphones:electronics",
  "product:notebook:stationery",
  "product:mug:kitchen",
).toDF("sku")

val df2 = df
  .withColumn("instr_pos",    instr(col("sku"), ":"))
  .withColumn("locate_pos",   locate(":", col("sku")))
  .withColumn("position_pos", expr("position(':' IN sku)"))

df2.show(false)
// +------------------------------+---------+----------+------------+
// |sku                           |instr_pos|locate_pos|position_pos|
// +------------------------------+---------+----------+------------+
// |product:laptop:electronics    |8        |8         |8           |
// |product:headphones:electronics|8        |8         |8           |
// |product:notebook:stationery   |8        |8         |8           |
// |product:mug:kitchen           |8        |8         |8           |
// +------------------------------+---------+----------+------------+

All three produce identical results. position uses SQL's POSITION(substr IN str) syntax rather than a function-call style, which is why it needs expr(). Use whichever reads most naturally for your codebase — instr if you prefer the column argument first, locate if you need the start-position parameter.

For extracting characters once you have a position, see substring for position-based extraction. For pattern-based searching rather than literal substrings, see regexp_replace. For splitting a string into parts by a delimiter, see split.

Example Details

Created: 2026-03-21 01:58:58 PM

Last Updated: 2026-03-21 01:58:58 PM