Job Board
Consulting

Spark Scala Format String

The format_string function formats column values into a string using printf-style patterns. It's useful for building human-readable labels, combining columns into structured text, or formatting numbers inline without changing their type first.

The format_string function is defined as:

def format_string(format: String, arguments: Column*): Column

The format parameter is a Java printf-style format string. Common placeholders include %s for strings, %d for integers, and %f for floating-point numbers. The arguments are the columns whose values get substituted into the format string.

Building Formatted Strings from Columns

val df = Seq(
  ("Alice", "Engineering", 3),
  ("Bob", "Marketing", 7),
  ("Carol", "Engineering", 1),
  ("David", "Sales", 5),
).toDF("name", "department", "years")

val df2 = df
  .withColumn("summary", format_string("%s has been in %s for %d years", col("name"), col("department"), col("years")))

df2.show(false)
// +-----+-----------+-----+-----------------------------------------+
// |name |department |years|summary                                  |
// +-----+-----------+-----+-----------------------------------------+
// |Alice|Engineering|3    |Alice has been in Engineering for 3 years|
// |Bob  |Marketing  |7    |Bob has been in Marketing for 7 years    |
// |Carol|Engineering|1    |Carol has been in Engineering for 1 years|
// |David|Sales      |5    |David has been in Sales for 5 years      |
// +-----+-----------+-----+-----------------------------------------+

This is similar to what you'd do with concat and concat_ws, but format_string keeps the template in one place rather than interleaving lit() calls between columns.

Formatting Numbers with Precision and Alignment

Printf patterns give you control over decimal places and padding. Use %.2f to lock a number to two decimal places, or %-15s to left-align a string in a 15-character field:

val df = Seq(
  ("Laptop", 999.9),
  ("Headphones", 49.5),
  ("Monitor", 349.0),
  ("Keyboard", 74.99),
).toDF("product", "price")

val df2 = df
  .withColumn("price_label", format_string("$%.2f", col("price")))
  .withColumn("padded", format_string("%-15s %8.2f", col("product"), col("price")))

df2.show(false)
// +----------+-----+-----------+------------------------+
// |product   |price|price_label|padded                  |
// +----------+-----+-----------+------------------------+
// |Laptop    |999.9|$999.90    |Laptop            999.90|
// |Headphones|49.5 |$49.50     |Headphones         49.50|
// |Monitor   |349.0|$349.00    |Monitor           349.00|
// |Keyboard  |74.99|$74.99     |Keyboard           74.99|
// +----------+-----+-----------+------------------------+

If you only need comma-separated thousands formatting (like 1,234.56), the dedicated format_number function is a simpler choice. format_string is more flexible when you need to combine text and numbers in a single template.

Handling Nulls

When any of the argument columns contain null, format_string substitutes the literal text null into the output — it does not return null for the entire result:

val df = Seq(
  ("Alice", 85.6, null.asInstanceOf[String]),
  ("Bob", 92.3, "A"),
  (null, 78.1, "B"),
  ("David", 88.0, "B+"),
).toDF("name", "score", "grade")

val df2 = df
  .withColumn("report", format_string("Student: %s | Score: %.1f | Grade: %s", col("name"), col("score"), col("grade")))

df2.show(false)
// +-----+-----+-----+------------------------------------------+
// |name |score|grade|report                                    |
// +-----+-----+-----+------------------------------------------+
// |Alice|85.6 |null |Student: Alice | Score: 85.6 | Grade: null|
// |Bob  |92.3 |A    |Student: Bob | Score: 92.3 | Grade: A     |
// |null |78.1 |B    |Student: null | Score: 78.1 | Grade: B    |
// |David|88.0 |B+   |Student: David | Score: 88.0 | Grade: B+  |
// +-----+-----+-----+------------------------------------------+

This differs from concat, which returns null if any input column is null. If you need to suppress the word "null" in output, use coalesce to replace nulls before passing columns into format_string.

Example Details

Created: 2026-04-11 10:43:59 PM

Last Updated: 2026-04-11 10:43:59 PM