Spark Scala sentences Function
sentences splits a string into an array of sentences, where each sentence is an array of words. It's useful for text analysis tasks like counting sentences, extracting individual words, or preparing text for downstream NLP processing.
sentences
def sentences(string: Column): Column
sentences is a SQL function — call it through expr rather than as a standalone function from org.apache.spark.sql.functions. It uses Java's BreakIterator to detect sentence boundaries (periods, question marks, exclamation points) and word boundaries within each sentence. The return type is array<array<string>>.
Here, a text column is split into its component sentences and words:
val df = Seq(
"The weather is nice today. Let's go for a walk.",
"Spark handles big data well. It scales across clusters. Performance is impressive.",
"Order shipped. Expected delivery on Friday.",
"Hello world!",
).toDF("text")
val df2 = df
.withColumn("words_by_sentence", expr("sentences(text)"))
df2.show(false)
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
// |text |words_by_sentence |
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
// |The weather is nice today. Let's go for a walk. |[[The, weather, is, nice, today], [Let's, go, for, a, walk]] |
// |Spark handles big data well. It scales across clusters. Performance is impressive.|[[Spark, handles, big, data, well], [It, scales, across, clusters], [Performance, is, impressive]]|
// |Order shipped. Expected delivery on Friday. |[[Order, shipped], [Expected, delivery, on, Friday]] |
// |Hello world! |[[Hello, world]] |
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
Each row produces a nested array — the outer array holds sentences, and each inner array holds the words within that sentence. Punctuation is stripped from the output. A single sentence like "Hello world!" produces one inner array with two words.
sentences with locale parameters
def sentences(string: Column, language: Column, country: Column): Column
The three-argument form accepts language and country columns to control the locale used for sentence and word detection. This is relevant when text uses locale-specific punctuation or abbreviation rules:
val df = Seq(
"The weather is nice today. Let's go for a walk.",
"Spark handles big data well. It scales across clusters. Performance is impressive.",
"Order shipped. Expected delivery on Friday.",
"Hello world!",
).toDF("text")
val df2 = df
.withColumn("words_by_sentence", expr("sentences(text, 'en', 'US')"))
df2.show(false)
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
// |text |words_by_sentence |
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
// |The weather is nice today. Let's go for a walk. |[[The, weather, is, nice, today], [Let's, go, for, a, walk]] |
// |Spark handles big data well. It scales across clusters. Performance is impressive.|[[Spark, handles, big, data, well], [It, scales, across, clusters], [Performance, is, impressive]]|
// |Order shipped. Expected delivery on Friday. |[[Order, shipped], [Expected, delivery, on, Friday]] |
// |Hello world! |[[Hello, world]] |
// +----------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------+
For English text, the results are typically the same with or without the locale parameters. The locale matters more for languages with different sentence-breaking rules.
Counting sentences and null handling
Since sentences returns a nested array, you can use size to count the number of sentences in each row. When sentences encounters a null value, it returns null:
val df = Seq(
("review_1", "Great product! Works as expected. Would buy again."),
("review_2", "Terrible quality."),
("review_3", "Fast shipping. Good packaging. Item was exactly as described."),
("review_4", null: String),
).toDF("id", "review")
val df2 = df
.withColumn("review_sentences", expr("sentences(review)"))
.withColumn("sentence_count", size(expr("sentences(review)")))
df2.show(false)
// +--------+-------------------------------------------------------------+--------------------------------------------------------------------------+--------------+
// |id |review |review_sentences |sentence_count|
// +--------+-------------------------------------------------------------+--------------------------------------------------------------------------+--------------+
// |review_1|Great product! Works as expected. Would buy again. |[[Great, product], [Works, as, expected], [Would, buy, again]] |3 |
// |review_2|Terrible quality. |[[Terrible, quality]] |1 |
// |review_3|Fast shipping. Good packaging. Item was exactly as described.|[[Fast, shipping], [Good, packaging], [Item, was, exactly, as, described]]|3 |
// |review_4|null |null |-1 |
// +--------+-------------------------------------------------------------+--------------------------------------------------------------------------+--------------+
Note that size returns -1 for null arrays rather than null. If you need a true null or zero for missing reviews, wrap it with a when/otherwise expression or coalesce.
Related functions
For splitting a string on a specific delimiter into a flat array, see split — it gives you more control over the split pattern but doesn't detect sentence boundaries. For case normalization before text analysis, see lower and upper and initcap.