Concatenate Columns Together in Spark Scala
The concat
function in Spark Scala takes multuple columns as input and returns a concated version of all of them. When any column in the list of columns to concatenate are null then the result is null.
The concat
function first appeared in version 1.5.0
and as of Spark 3.4.1 it is defined as:
def concat(exprs: Column*): Column
There are multiple ways to concatinate string (and other column types) together in spark scala, so let's check out a quick example:
val df = Seq(
("John", "Enid"),
("Tiffany", "Yurt"),
("Ahmed", "Khara"),
).toDF("fname", "lname")
val df2 = df
.withColumn("full_name", concat(col("fname"), lit(" "), col("lname")))
df2.show()
// +-------+-----+------------+
// | fname|lname| full_name|
// +-------+-----+------------+
// | John| Enid| John Enid|
// |Tiffany| Yurt|Tiffany Yurt|
// | Ahmed|Khara| Ahmed Khara|
// +-------+-----+------------+
When working with complex data systems we often have to navigate what to do when data is missing or null. When concat encounters a null the entire result is null:
val df = Seq(
("John", "Enid"),
("Tiffany", "Yurt"),
("Ahmed", "Khara"),
("Richard", null),
).toDF("fname", "lname")
val df2 = df
.withColumn("full_name", concat(col("fname"), lit(" "), col("lname")))
df2.show()
// +-------+-----+------------+
// | fname|lname| full_name|
// +-------+-----+------------+
// | John| Enid| John Enid|
// |Tiffany| Yurt|Tiffany Yurt|
// | Ahmed|Khara| Ahmed Khara|
// |Richard| null| null|
// +-------+-----+------------+
One way around this is to transform out the null's with case statements:
val df = Seq(
("John", "Enid"),
("Tiffany", "Yurt"),
("Ahmed", "Khara"),
("Richard", null),
).toDF("fname", "lname")
val df2 = df
.withColumn("tmp_lname", when(col("lname").isNull, lit("")).otherwise(col("lname")))
.withColumn("full_name", concat(col("fname"), lit(" "), col("tmp_lname")))
.drop("tmp_lname")
df2.show()
// +-------+-----+------------+
// | fname|lname| full_name|
// +-------+-----+------------+
// | John| Enid| John Enid|
// |Tiffany| Yurt|Tiffany Yurt|
// | Ahmed|Khara| Ahmed Khara|
// |Richard| null| Richard |
// +-------+-----+------------+
You also have the option to use concat_ws
(concat with separator) to more easily manage nulls. It also is a cleaner interface when you have many columns that will be joined with the same separator. Let's check out some examples for comparison:
val df = Seq(
("John", "H", "Enid"),
("Tiffany", "Ruby", "Yurt"),
("Ahmed", null, "Khara"),
("Richard", null, null),
).toDF("fname", "mname", "lname")
val df2 = df
.withColumn("full_name_1", concat(col("fname"), lit(" "), col("mname"), lit(" "), col("lname")))
.withColumn("full_name_2", concat_ws(" ", col("fname"), col("mname"), col("lname")))
df2.show()
// +-------+-----+-----+-----------------+-----------------+
// | fname|mname|lname| full_name_1| full_name_2|
// +-------+-----+-----+-----------------+-----------------+
// | John| H| Enid| John H Enid| John H Enid|
// |Tiffany| Ruby| Yurt|Tiffany Ruby Yurt|Tiffany Ruby Yurt|
// | Ahmed| null|Khara| null| Ahmed Khara|
// |Richard| null| null| null| Richard|
// +-------+-----+-----+-----------------+-----------------+
In the results above you can see that concat_ws
is easier to work with when you have data that may contain nulls. It handles the logic around the separator and the null values more cleanly.
You can also use concat
and concat_ws
to combine arrays. The effects are a bit different where concat_ws
will convert them to strings and concat will combine the arrays themselves.
val df = Seq(
(Some(Array("AAA", "BBB")), Some(Array("BBB", "CCC"))),
(None, Some(Array("BBB", "CCC"))),
(Some(Array("AAA", "BBB")), None),
).toDF("arr1", "arr2")
val df2 = df
.withColumn("r_1", concat(col("arr1"), col("arr2")))
.withColumn("r_2", concat_ws("|", col("arr1"), col("arr2")))
df2.printSchema
// root
// |-- arr1: array (nullable = true)
// | |-- element: string (containsNull = true)
// |-- arr2: array (nullable = true)
// | |-- element: string (containsNull = true)
// |-- r_1: array (nullable = true)
// | |-- element: string (containsNull = true)
// |-- r_2: string (nullable = false)
df2.show()
// +----------+----------+--------------------+---------------+
// | arr1| arr2| r_1| r_2|
// +----------+----------+--------------------+---------------+
// |[AAA, BBB]|[BBB, CCC]|[AAA, BBB, BBB, CCC]|AAA|BBB|BBB|CCC|
// | null|[BBB, CCC]| null| BBB|CCC|
// |[AAA, BBB]| null| null| AAA|BBB|
// +----------+----------+--------------------+---------------+