Spark Scala IsIn Examples

The isin function is defined on a spark column and is used to filter rows in a DataFrame or DataSet.

The isin funcion will compare the value of the column to the list of values provided in the isin call.

You you are familiar with SQL the isin function is the equivelant to the IN operator.

Filtering data frames is a very common and useful technique in data engineering and the isin function will be commonly used in any data pipeline.

IsIn Filtering Example

Let's assume you have a list of states and you only want ones that are located on the east coast:

val df = Seq(
  ("CA"),
  ("NY"),
  ("FL"),
  ("IL"),
).toDF("state")

val df2 = df.filter(col("state").isin("NY", "FL"))

df2.show()

// +-----+
// |state|
// +-----+
// |NY   |
// |FL   |
// +-----+

IsIn Enrichment Example

The isin operator doesn't just need to be used for filtering. It operates on a column and returns true or false. You can also use it to enrich datasets based upon the resultant value. Let's update the example above where instead of filtering we will flag the dataset with a new column:

val df = Seq(
  ("CA"),
  ("NY"),
  ("FL"),
  ("IL"),
).toDF("state")

val df2 = df.withColumn("east_coast", col("state").isin("NY", "FL"))

df2.show()

// +-----+----------+
// |state|east_coast|
// +-----+----------+
// |CA   |false     |
// |NY   |true      |
// |FL   |true      |
// |IL   |false     |
// +-----+----------+

Overall

As you can see from the above two examples the isin function can be particularly useful for data cleansing, data subsetting, data enrcihment or any scenario where you need to perform conditional filtering based on a list of values.