Spark Scala Trim Functions: trim, ltrim and rtrim
When doing string manipulations in Spark Scala Data Frames, trim
is a frequently used function that can quickly help clean up or trim the whitespace (or any characters) from the start and ends of a string column.
Let's see a quick example in action:
val df = Seq(
" Hello, world!",
" Hello, world! ",
"Hello, world! ",
).toDF("example")
val df2 = df
.withColumn("result", trim(col("example")))
df2.show(false)
// +-----------------------+-------------+
// |example |result |
// +-----------------------+-------------+
// | Hello, world! |Hello, world!|
// | Hello, world! |Hello, world!|
// |Hello, world! |Hello, world!|
// +-----------------------+-------------+
By default, when calling the trim
function on a string column, only spaces are removed. There is also an overloaded version of the function that allows you to take in a string and it will remove all characters found within that string as well. It is defined as:
def trim(e: Column, trimString: String): Column
You pass into the trimString
parameter the characters you want to remove. So if you want to remove some other whitespace characters it will look something like this:
val df = Seq(
" Salutations, Earthlings!",
" \n Greetings, Universe! \t ",
"\tHello, world!\n",
" \n\t Greetings, Universe!\t \n \t\t",
).toDF("example")
val df2 = df
.withColumn("result_1", trim(col("example")))
.withColumn("result_2", trim(col("example"), " \n\t"))
df2.show(false)
// +-------------------------------------+-----------------------------------+------------------------+
// |example |result_1 |result_2 |
// +-------------------------------------+-----------------------------------+------------------------+
// | Salutations, Earthlings! |Salutations, Earthlings! |Salutations, Earthlings!|
// | \n Greetings, Universe! \t |\n Greetings, Universe! \t |Greetings, Universe! |
// |\tHello, world!\n |\tHello, world!\n |Hello, world! |
// | \n\t Greetings, Universe!\t \n \t\t|\n\t Greetings, Universe!\t \n \t\t|Greetings, Universe! |
// +-------------------------------------+-----------------------------------+------------------------+
Depending upon the type of data cleansing you are doing, you may only need to trim the left or right side of a string. Spark also provides an ltrim
and rtrim
function that work the same way as trim
except that they only focuse on the respective sides.
val df = Seq(
" Hello, world! ",
" \t Hello, world! \t\n ",
).toDF("example")
val df2 = df
.withColumn("ltrim_1", ltrim(col("example")))
.withColumn("ltrim_2", ltrim(col("example"), " \n\t"))
.withColumn("rtrim_1", rtrim(col("example")))
.withColumn("rtrim_2", rtrim(col("example"), " \n\t"))
df2.show(false)
// +-------------------------+------------------------+--------------------+-----------------------+------------------+
// |example |ltrim_1 |ltrim_2 |rtrim_1 |rtrim_2 |
// +-------------------------+------------------------+--------------------+-----------------------+------------------+
// | Hello, world! |Hello, world! |Hello, world! | Hello, world! | Hello, world! |
// | \t Hello, world! \t\n |\t Hello, world! \t\n |Hello, world! \t\n | \t Hello, world! \t\n| \t Hello, world!|
// +-------------------------+------------------------+--------------------+-----------------------+------------------+