Job Board
Consulting

Spark Scala split Function

split breaks a string column on a delimiter or regular expression pattern and returns an ArrayType column. It's the go-to function whenever you need to turn a delimited string — like a CSV field, a tag list, or a log line — into individual elements you can work with.

split

def split(str: Column, pattern: String): Column

The two-argument form splits str on every occurrence of pattern and returns all parts as an array. pattern is a Java regular expression, so special characters like ., |, *, and + must be escaped with \\ if you want them treated as literals.

Here, a comma-delimited skills column is split into an array:

val df = Seq(
  ("Alice Chen",   "scala,spark,functional-programming"),
  ("Bob Martinez", "java,spring,microservices"),
  ("Diana Okafor", "python,pandas,machine-learning"),
  ("Evan Patel",   "go,kubernetes,devops"),
).toDF("name", "skills")

val df2 = df
  .withColumn("skill_array", split(col("skills"), ","))

df2.show(false)
// +------------+----------------------------------+--------------------------------------+
// |name        |skills                            |skill_array                           |
// +------------+----------------------------------+--------------------------------------+
// |Alice Chen  |scala,spark,functional-programming|[scala, spark, functional-programming]|
// |Bob Martinez|java,spring,microservices         |[java, spring, microservices]         |
// |Diana Okafor|python,pandas,machine-learning    |[python, pandas, machine-learning]    |
// |Evan Patel  |go,kubernetes,devops              |[go, kubernetes, devops]              |
// +------------+----------------------------------+--------------------------------------+

Once skill_array is an array column, you can pass it to collection functions like array_contains or use .getItem(n) to pull out an element by index (0-based).

If the pattern is not found anywhere in the string, split returns a single-element array containing the original string unchanged. If str is null, the result is null.

split with a limit

def split(str: Column, pattern: String, limit: Int): Column

The split function first appeared in version 3.0.0 with a limit parameter that controls how many times the pattern is applied:

  • limit > 0: The array has at most limit elements. The last element contains the remainder of the string — everything that would have been further split if the limit hadn't been reached.
  • limit <= 0: The pattern is applied as many times as possible (same as the two-argument form).

With limit=2, only the first comma is used as a split point:

val df = Seq(
  ("Alice Chen",   "scala,spark,functional-programming"),
  ("Bob Martinez", "java,spring,microservices"),
  ("Diana Okafor", "python,pandas,machine-learning"),
  ("Evan Patel",   "go,kubernetes,devops"),
).toDF("name", "skills")

val df2 = df
  .withColumn("top_two", split(col("skills"), ",", 2))

df2.show(false)
// +------------+----------------------------------+-------------------------------------+
// |name        |skills                            |top_two                              |
// +------------+----------------------------------+-------------------------------------+
// |Alice Chen  |scala,spark,functional-programming|[scala, spark,functional-programming]|
// |Bob Martinez|java,spring,microservices         |[java, spring,microservices]         |
// |Diana Okafor|python,pandas,machine-learning    |[python, pandas,machine-learning]    |
// |Evan Patel  |go,kubernetes,devops              |[go, kubernetes,devops]              |
// +------------+----------------------------------+-------------------------------------+

The first element is the part before the first comma (scala). The second element is everything that follows (spark,functional-programming) — the unsplit remainder.

The limit is especially useful for structured text where you only care about a fixed number of leading fields, like parsing log lines. Using \\s+ (one or more whitespace characters) as the pattern and limit=3 splits each line into timestamp, level, and the full message body:

val df = Seq(
  "2024-03-15T09:42:00Z INFO  User alice logged in",
  "2024-03-15T09:43:12Z WARN  Disk usage at 85%",
  "2024-03-15T09:44:55Z ERROR Database connection lost",
  "2024-03-15T09:45:01Z INFO  Reconnect attempt 1",
).toDF("log_line")

val df2 = df
  .withColumn("parts",     split(col("log_line"), "\s+", 3))
  .withColumn("timestamp", split(col("log_line"), "\s+", 3).getItem(0))
  .withColumn("level",     split(col("log_line"), "\s+", 3).getItem(1))
  .withColumn("message",   split(col("log_line"), "\s+", 3).getItem(2))

df2.show(false)
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+
// |log_line                                           |parts                                                  |timestamp           |level|message                 |
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+
// |2024-03-15T09:42:00Z INFO  User alice logged in    |[2024-03-15T09:42:00Z, INFO, User alice logged in]     |2024-03-15T09:42:00Z|INFO |User alice logged in    |
// |2024-03-15T09:43:12Z WARN  Disk usage at 85%       |[2024-03-15T09:43:12Z, WARN, Disk usage at 85%]        |2024-03-15T09:43:12Z|WARN |Disk usage at 85%       |
// |2024-03-15T09:44:55Z ERROR Database connection lost|[2024-03-15T09:44:55Z, ERROR, Database connection lost]|2024-03-15T09:44:55Z|ERROR|Database connection lost|
// |2024-03-15T09:45:01Z INFO  Reconnect attempt 1     |[2024-03-15T09:45:01Z, INFO, Reconnect attempt 1]      |2024-03-15T09:45:01Z|INFO |Reconnect attempt 1     |
// +---------------------------------------------------+-------------------------------------------------------+--------------------+-----+------------------------+

\\s+ handles the double space between INFO and the message — it matches one or more whitespace characters as a single separator. With limit=3, the third element captures everything from the third token onward as one string, so the full message is preserved.

For delimiter-based extraction that doesn't return an array, see substring_index — it returns the part of a string before or after the Nth occurrence of a delimiter. For pattern-based string replacement, see regexp_replace.

Example Details

Created: 2026-03-17 10:09:57 PM

Last Updated: 2026-03-17 10:09:57 PM