Job Board
Consulting

Coalesce Multiple Columns in Spark Scala

The coalesce function returns the first non-null value from a list of columns. It's a common technique when you have multiple values and you want to prioritize selecting the first available one from them.

This is commonly used when you're dealing with data where missing values need to be filled in with the first available non-null value.

The coalesce function first appeared in version 1.3.0 and as of Spark 3.4.1 it is defined as:

def coalesce(e: Column*): Column

A Simple Example

Let's look at an example. Consider a scenario where you have a bunch of fields for different contact phone numbers for people. One for the home, office and cell. Let's say you want to select just one number based upon a priority. Now a days that would be cell phone first!

val df = Seq(
  ("Bob", "555-555-1111", "555-555-1112", "555-555-1113"),
  ("Joe", null, null, "555-555-2223"),
  ("Melissa", "555-555-3333", "555-555-3332", null),
  ("Xing", null, "555-555-4442", null),
).toDF("name", "home", "office", "cell")

val df2 = df
  .withColumn("phone", coalesce(col("cell"), col("home"), col("office")))

df2.show()

// +-------+------------+------------+------------+------------+
// |   name|        home|      office|        cell|       phone|
// +-------+------------+------------+------------+------------+
// |    Bob|555-555-1111|555-555-1112|555-555-1113|555-555-1113|
// |    Joe|        null|        null|555-555-2223|555-555-2223|
// |Melissa|555-555-3333|555-555-3332|        null|555-555-3333|
// |   Xing|        null|555-555-4442|        null|555-555-4442|
// +-------+------------+------------+------------+------------+

As you can see from the example above the cell number is chosen first, followed by the home and then finally the office. If there are null's in every column in the parameter list null is returned.

val df = Seq(
  ("Paul", null, null, null),
).toDF("name", "home", "office", "cell")

val df2 = df
  .withColumn("phone", coalesce(col("cell"), col("home"), col("office")))

df2.show()

// +----+----+------+----+-----+
// |name|home|office|cell|phone|
// +----+----+------+----+-----+
// |Paul|null|  null|null| null|
// +----+----+------+----+-----+

Another useful technique is when you are selecting from a series of complex statements. Since columns are just expressions these can be really flexible in helping to create a clear and easy to undertand data cleaning pipeline. Let's udpate the example above to add a preffered contact method that is taken into consideration within the coalesce logic.

val df = Seq(
  ("Bob", "home", "555-555-1111", "555-555-1112", "555-555-1113"),
  ("Joe", null, null, null, "555-555-2223"),
  ("Melissa", "office", "555-555-3333", "555-555-3332", null),
  ("Xing", null, null, "555-555-4442", null),
).toDF("name", "preferred", "home", "office", "cell")

val df2 = df
  .withColumn("phone", coalesce(
    when(col("preferred") === "home", col("home")),
    when(col("preferred") === "office", col("office")),
    col("cell"),
    col("home"),
    col("office")
  ))

df2.show()

// +-------+---------+------------+------------+------------+------------+
// |   name|preferred|        home|      office|        cell|       phone|
// +-------+---------+------------+------------+------------+------------+
// |    Bob|     home|555-555-1111|555-555-1112|555-555-1113|555-555-1111|
// |    Joe|     null|        null|        null|555-555-2223|555-555-2223|
// |Melissa|   office|555-555-3333|555-555-3332|        null|555-555-3332|
// |   Xing|     null|        null|555-555-4442|        null|555-555-4442|
// +-------+---------+------------+------------+------------+------------+

Sometimes it is really usefull to have a fallback 'default' value for a column. Coalesce can be used in these situations to easily provide this 'default' value if one isn't available. Let's apply some data transformation logic to fall back to a default 'preferred' contact method if one isn't available.

val df = Seq(
  ("Bob", "home"),
  ("Joe", null),
  ("Melissa", "office"),
  ("Xing", null),
).toDF("name", "preferred")

val df2 = df
  .withColumn("preferred", coalesce(col("preferred"), lit("cell")))

df2.show()

// +-------+---------+
// |   name|preferred|
// +-------+---------+
// |    Bob|     home|
// |    Joe|     cell|
// |Melissa|   office|
// |   Xing|     cell|
// +-------+---------+

Coalesce is one of my favorite transforms to use. I find my self often relying on it when doing basic data cleaning tasks, I hope it helps you too!

Example Details

Created: 2023-08-14 09:06:00 PM

Last Updated: 2023-08-14 09:06:00 PM