Spark Scala Format String
The format_string function formats column values into a string using printf-style patterns. It's useful for building human-readable labels, combining columns into structured text, or formatting numbers inline without changing their type first.
The format_string function is defined as:
def format_string(format: String, arguments: Column*): Column
The format parameter is a Java printf-style format string. Common placeholders include %s for strings, %d for integers, and %f for floating-point numbers. The arguments are the columns whose values get substituted into the format string.
Building Formatted Strings from Columns
val df = Seq(
("Alice", "Engineering", 3),
("Bob", "Marketing", 7),
("Carol", "Engineering", 1),
("David", "Sales", 5),
).toDF("name", "department", "years")
val df2 = df
.withColumn("summary", format_string("%s has been in %s for %d years", col("name"), col("department"), col("years")))
df2.show(false)
// +-----+-----------+-----+-----------------------------------------+
// |name |department |years|summary |
// +-----+-----------+-----+-----------------------------------------+
// |Alice|Engineering|3 |Alice has been in Engineering for 3 years|
// |Bob |Marketing |7 |Bob has been in Marketing for 7 years |
// |Carol|Engineering|1 |Carol has been in Engineering for 1 years|
// |David|Sales |5 |David has been in Sales for 5 years |
// +-----+-----------+-----+-----------------------------------------+
This is similar to what you'd do with concat and concat_ws, but format_string keeps the template in one place rather than interleaving lit() calls between columns.
Formatting Numbers with Precision and Alignment
Printf patterns give you control over decimal places and padding. Use %.2f to lock a number to two decimal places, or %-15s to left-align a string in a 15-character field:
val df = Seq(
("Laptop", 999.9),
("Headphones", 49.5),
("Monitor", 349.0),
("Keyboard", 74.99),
).toDF("product", "price")
val df2 = df
.withColumn("price_label", format_string("$%.2f", col("price")))
.withColumn("padded", format_string("%-15s %8.2f", col("product"), col("price")))
df2.show(false)
// +----------+-----+-----------+------------------------+
// |product |price|price_label|padded |
// +----------+-----+-----------+------------------------+
// |Laptop |999.9|$999.90 |Laptop 999.90|
// |Headphones|49.5 |$49.50 |Headphones 49.50|
// |Monitor |349.0|$349.00 |Monitor 349.00|
// |Keyboard |74.99|$74.99 |Keyboard 74.99|
// +----------+-----+-----------+------------------------+
If you only need comma-separated thousands formatting (like 1,234.56), the dedicated format_number function is a simpler choice. format_string is more flexible when you need to combine text and numbers in a single template.
Handling Nulls
When any of the argument columns contain null, format_string substitutes the literal text null into the output — it does not return null for the entire result:
val df = Seq(
("Alice", 85.6, null.asInstanceOf[String]),
("Bob", 92.3, "A"),
(null, 78.1, "B"),
("David", 88.0, "B+"),
).toDF("name", "score", "grade")
val df2 = df
.withColumn("report", format_string("Student: %s | Score: %.1f | Grade: %s", col("name"), col("score"), col("grade")))
df2.show(false)
// +-----+-----+-----+------------------------------------------+
// |name |score|grade|report |
// +-----+-----+-----+------------------------------------------+
// |Alice|85.6 |null |Student: Alice | Score: 85.6 | Grade: null|
// |Bob |92.3 |A |Student: Bob | Score: 92.3 | Grade: A |
// |null |78.1 |B |Student: null | Score: 78.1 | Grade: B |
// |David|88.0 |B+ |Student: David | Score: 88.0 | Grade: B+ |
// +-----+-----+-----+------------------------------------------+
This differs from concat, which returns null if any input column is null. If you need to suppress the word "null" in output, use coalesce to replace nulls before passing columns into format_string.