Spark Scala find_in_set
find_in_set returns the 1-based position of a string within a comma-delimited list stored in another column. It returns 0 if the string isn't found and null if either input is null.
find_in_set
find_in_set(str, strArray) - Returns the index (1-based) of str in the comma-delimited strArray
find_in_set is a SQL function — use it via expr() in Spark Scala. It takes two arguments: the string to search for and a column containing comma-separated values. It returns an integer: the 1-based position if found, or 0 if not.
Here's an example that checks which position each role occupies in a comma-delimited roles column:
val df = Seq(
("Alice", "admin,editor,viewer"),
("Bob", "editor,viewer"),
("Carol", "admin,viewer"),
("Dave", "viewer"),
("Eve", "admin,editor"),
).toDF("name", "roles")
val df2 = df
.withColumn("admin_pos", expr("find_in_set('admin', roles)"))
.withColumn("editor_pos", expr("find_in_set('editor', roles)"))
.withColumn("viewer_pos", expr("find_in_set('viewer', roles)"))
df2.show(false)
// +-----+-------------------+---------+----------+----------+
// |name |roles |admin_pos|editor_pos|viewer_pos|
// +-----+-------------------+---------+----------+----------+
// |Alice|admin,editor,viewer|1 |2 |3 |
// |Bob |editor,viewer |0 |1 |2 |
// |Carol|admin,viewer |1 |0 |2 |
// |Dave |viewer |0 |0 |1 |
// |Eve |admin,editor |1 |2 |0 |
// +-----+-------------------+---------+----------+----------+
Alice has all three roles — admin at position 1, editor at 2, viewer at 3. Bob has no admin role, so admin_pos returns 0. The return value is a position, not a boolean, but you can use it as one: any non-zero value means the string was found.
To filter rows where a specific value exists in the list, check for a non-zero result:
scala
df.filter(expr("find_in_set('admin', roles)") > 0)
Edge cases: commas, empty strings, and nulls
find_in_set has specific behavior worth knowing about:
- If the search string contains a comma, the result is always
0— it's treated as not found because commas are the delimiter. - An empty string can match an empty element in the list (two consecutive commas create an empty element).
- Null in either argument returns
null.
val df = Seq(
("admin", "admin,editor,viewer"),
("manager", "admin,editor,viewer"),
("admin,editor", "admin,editor,viewer"),
("", "admin,,viewer"),
(null, "admin,editor,viewer"),
).toDF("search", "roles")
val df2 = df
.withColumn("result", expr("find_in_set(search, roles)"))
df2.show(false)
// +------------+-------------------+------+
// |search |roles |result|
// +------------+-------------------+------+
// |admin |admin,editor,viewer|1 |
// |manager |admin,editor,viewer|0 |
// |admin,editor|admin,editor,viewer|0 |
// | |admin,,viewer |2 |
// |null |admin,editor,viewer|null |
// +------------+-------------------+------+
"admin,editor" returns 0 because the search string itself contains a comma. The empty string "" matches the empty element between the two consecutive commas in "admin,,viewer" at position 2. The null row returns null as expected.
Related functions
For checking whether a value exists in a fixed list of literals (not a comma-delimited string column), see isin. For splitting a comma-delimited string into an array column, see split. For finding the position of a substring within a string (rather than a value in a delimited list), see instr and locate.