Spark Scala Trim Functions: trim, ltrim and rtrim

When doing string manipulations in Spark Scala Data Frames, trim is a frequently used function that can quickly help clean up or trim the whitespace (or any characters) from the start and ends of a string column.

Let's see a quick example in action:

val df = Seq(
  "   Hello, world!",
  "      Hello, world!    ",
  "Hello, world!     ",
).toDF("example")

val df2 = df
  .withColumn("result", trim(col("example")))

df2.show(false)
// +-----------------------+-------------+
// |example                |result       |
// +-----------------------+-------------+
// |   Hello, world!       |Hello, world!|
// |      Hello, world!    |Hello, world!|
// |Hello, world!          |Hello, world!|
// +-----------------------+-------------+

By default, when calling the trim function on a string column, only spaces are removed. There is also an overloaded version of the function that allows you to take in a string and it will remove all characters found within that string as well. It is defined as:

def trim(e: Column, trimString: String): Column

You pass into the trimString parameter the characters you want to remove. So if you want to remove some other whitespace characters it will look something like this:

val df = Seq(
  "   Salutations, Earthlings!",
  "  \n    Greetings, Universe! \t   ",
  "\tHello, world!\n",
  "  \n\t Greetings, Universe!\t \n \t\t",
).toDF("example")

val df2 = df
  .withColumn("result_1", trim(col("example")))
  .withColumn("result_2", trim(col("example"), " \n\t"))

df2.show(false)
// +-------------------------------------+-----------------------------------+------------------------+
// |example                              |result_1                           |result_2                |
// +-------------------------------------+-----------------------------------+------------------------+
// |   Salutations, Earthlings!          |Salutations, Earthlings!           |Salutations, Earthlings!|
// |  \n    Greetings, Universe! \t      |\n    Greetings, Universe! \t      |Greetings, Universe!    |
// |\tHello, world!\n                    |\tHello, world!\n                  |Hello, world!           |
// |  \n\t Greetings, Universe!\t \n \t\t|\n\t Greetings, Universe!\t \n \t\t|Greetings, Universe!    |
// +-------------------------------------+-----------------------------------+------------------------+

Depending upon the type of data cleansing you are doing, you may only need to trim the left or right side of a string. Spark also provides an ltrim and rtrim function that work the same way as trim except that they only focuse on the respective sides.

val df = Seq(
  "   Hello, world!    ",
  " \t  Hello, world! \t\n  ",
).toDF("example")

val df2 = df
  .withColumn("ltrim_1", ltrim(col("example")))
  .withColumn("ltrim_2", ltrim(col("example"), " \n\t"))
  .withColumn("rtrim_1", rtrim(col("example")))
  .withColumn("rtrim_2", rtrim(col("example"), " \n\t"))

df2.show(false)
// +-------------------------+------------------------+--------------------+-----------------------+------------------+
// |example                  |ltrim_1                 |ltrim_2             |rtrim_1                |rtrim_2           |
// +-------------------------+------------------------+--------------------+-----------------------+------------------+
// |   Hello, world!         |Hello, world!           |Hello, world!       |   Hello, world!       |   Hello, world!  |
// | \t  Hello, world! \t\n  |\t  Hello, world! \t\n  |Hello, world! \t\n  | \t  Hello, world! \t\n| \t  Hello, world!|
// +-------------------------+------------------------+--------------------+-----------------------+------------------+