Spark Scala IsIn Examples
The isin
function is defined on a spark column and is used to filter rows in a DataFrame or DataSet.
The isin
funcion will compare the value of the column to the list of values provided in the isin
call.
You you are familiar with SQL the isin
function is the equivelant to the IN
operator.
Filtering data frames is a very common and useful technique in data engineering and the isin
function will be commonly used in any data pipeline.
IsIn Filtering Example
Let's assume you have a list of states and you only want ones that are located on the east coast:
val df = Seq(
("CA"),
("NY"),
("FL"),
("IL"),
).toDF("state")
val df2 = df.filter(col("state").isin("NY", "FL"))
df2.show()
// +-----+
// |state|
// +-----+
// |NY |
// |FL |
// +-----+
IsIn Enrichment Example
The isin
operator doesn't just need to be used for filtering. It operates on a column and returns true or false. You can also use it to enrich datasets based upon the resultant value. Let's update the example above where instead of filtering we will flag the dataset with a new column:
val df = Seq(
("CA"),
("NY"),
("FL"),
("IL"),
).toDF("state")
val df2 = df.withColumn("east_coast", col("state").isin("NY", "FL"))
df2.show()
// +-----+----------+
// |state|east_coast|
// +-----+----------+
// |CA |false |
// |NY |true |
// |FL |true |
// |IL |false |
// +-----+----------+
Overall
As you can see from the above two examples the isin
function can be particularly useful for data cleansing, data subsetting, data enrcihment or any scenario where you need to perform conditional filtering based on a list of values.