Spark Scala split Function
split breaks a string column on a delimiter or regular expression pattern and returns an ArrayType column. It's the go-to function whenever you need to turn a delimited string — like a CSV field, a tag list, or a log line — into individual elements you can work with.
split
def split(str: Column, pattern: String): Column
The two-argument form splits str on every occurrence of pattern and returns all parts as an array. pattern is a Java regular expression, so special characters like ., |, *, and + must be escaped with \\ if you want them treated as literals.
Here, a comma-delimited skills column is split into an array:
val df = Seq(
("Alice Chen", "scala,spark,functional-programming"),
("Bob Martinez", "java,spring,microservices"),
("Diana Okafor", "python,pandas,machine-learning"),
("Evan Patel", "go,kubernetes,devops"),
).toDF("name", "skills")
val df2 = df
.withColumn("skill_array", split(col("skills"), ","))
df2.show(false)
// +------------+----------------------------------+--------------------------------------+
// |name |skills |skill_array |
// +------------+----------------------------------+--------------------------------------+
// |Alice Chen |scala,spark,functional-programming|[scala, spark, functional-programming]|
// |Bob Martinez|java,spring,microservices |[java, spring, microservices] |
// |Diana Okafor|python,pandas,machine-learning |[python, pandas, machine-learning] |
// |Evan Patel |go,kubernetes,devops |[go, kubernetes, devops] |
// +------------+----------------------------------+--------------------------------------+
Once skill_array is an array column, you can pass it to collection functions like array_contains or use .getItem(n) to pull out an element by index (0-based).
If the pattern is not found anywhere in the string, split returns a single-element array containing the original string unchanged. If str is null, the result is null.
split with a limit
def split(str: Column, pattern: String, limit: Int): Column
The split function first appeared in version 3.0.0 with a limit parameter that controls how many times the pattern is applied:
limit > 0: The array has at mostlimitelements. The last element contains the remainder of the string — everything that would have been further split if the limit hadn't been reached.limit <= 0: The pattern is applied as many times as possible (same as the two-argument form).
With limit=2, only the first comma is used as a split point:
val df = Seq(
("Alice Chen", "scala,spark,functional-programming"),
("Bob Martinez", "java,spring,microservices"),
("Diana Okafor", "python,pandas,machine-learning"),
("Evan Patel", "go,kubernetes,devops"),
).toDF("name", "skills")
val df2 = df
.withColumn("top_two", split(col("skills"), ",", 2))
df2.show(false)
// +------------+----------------------------------+-------------------------------------+
// |name |skills |top_two |
// +------------+----------------------------------+-------------------------------------+
// |Alice Chen |scala,spark,functional-programming|[scala, spark,functional-programming]|
// |Bob Martinez|java,spring,microservices |[java, spring,microservices] |
// |Diana Okafor|python,pandas,machine-learning |[python, pandas,machine-learning] |
// |Evan Patel |go,kubernetes,devops |[go, kubernetes,devops] |
// +------------+----------------------------------+-------------------------------------+
The first element is the part before the first comma (scala). The second element is everything that follows (spark,functional-programming) — the unsplit remainder.
The limit is especially useful for structured text where you only care about a fixed number of leading fields, like parsing log lines. Using \\s+ (one or more whitespace characters) as the pattern and limit=3 splits each line into timestamp, level, and the full message body:
val df = Seq(
"2024-03-15T09:42:00Z INFO User alice logged in",
"2024-03-15T09:43:12Z WARN Disk usage at 85%",
"2024-03-15T09:44:55Z ERROR Database connection lost",
"2024-03-15T09:45:01Z INFO Reconnect attempt 1",
).toDF("log_line")
val df2 = df
.withColumn("parts", split(col("log_line"), "\s+", 3))
.withColumn("timestamp", split(col("log_line"), "\s+", 3).getItem(0))
.withColumn("level", split(col("log_line"), "\s+", 3).getItem(1))
.withColumn("message", split(col("log_line"), "\s+", 3).getItem(2))
df2.show(false)
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+
// |log_line |parts |timestamp |level|message |
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+
// |2024-03-15T09:42:00Z INFO User alice logged in |[2024-03-15T09:42:00Z, INFO, User alice logged in] |2024-03-15T09:42:00Z|INFO |User alice logged in |
// |2024-03-15T09:43:12Z WARN Disk usage at 85% |[2024-03-15T09:43:12Z, WARN, Disk usage at 85%] |2024-03-15T09:43:12Z|WARN |Disk usage at 85% |
// |2024-03-15T09:44:55Z ERROR Database connection lost|[2024-03-15T09:44:55Z, ERROR, Database connection lost]|2024-03-15T09:44:55Z|ERROR|Database connection lost|
// |2024-03-15T09:45:01Z INFO Reconnect attempt 1 |[2024-03-15T09:45:01Z, INFO, Reconnect attempt 1] |2024-03-15T09:45:01Z|INFO |Reconnect attempt 1 |
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+
\\s+ handles the double space between INFO and the message — it matches one or more whitespace characters as a single separator. With limit=3, the third element captures everything from the third token onward as one string, so the full message is preserved.
Related functions
For delimiter-based extraction that doesn't return an array, see substring_index — it returns the part of a string before or after the Nth occurrence of a delimiter. For pattern-based string replacement, see regexp_replace.