Job Board
Consulting

Why null =!= null Returns null in Spark Scala, Not true

If you've ever tried to check for missing values using =!= or === in Spark and gotten surprising results, you've hit SQL three-valued logic. In Spark, null doesn't mean false — it means unknown — and that changes how comparisons behave in ways that can silently corrupt your data.

The Surprising Behavior

Here's the setup: you have a DataFrame with some rows that are missing an email address. You add a column to flag those missing rows.

val df = Seq(
  ("Alice", Some("alice@example.com")),
  ("Bob", None),
  ("Charlie", Some("charlie@example.com")),
  ("Diana", None),
).toDF("name", "email")

val result = df.withColumn("email_missing", col("email") =!= col("email"))

result.show(false)
// +-------+-------------------+-------------+
// |name   |email              |email_missing|
// +-------+-------------------+-------------+
// |Alice  |alice@example.com  |false        |
// |Bob    |null               |null         |
// |Charlie|charlie@example.com|false        |
// |Diana  |null               |null         |
// +-------+-------------------+-------------+

email_missing is null for Bob and Diana — not true. And notice: alice@example.com =!= alice@example.com is false, which makes sense. But the null rows return null, not true or false.

Why This Happens

In SQL (and Spark follows SQL semantics), null represents an unknown value. When you compare two unknowns, the result is also unknown — not true, not false.

Think of it this way: if you don't know what Bob's email address is, you can't answer whether it's equal to something or not. The answer isn't "yes it's different", it's "I don't know." That's what null means in a comparison result.

This is called three-valued logic: every comparison can return true, false, or null. The rules are:

  • null === nullnull (not true)
  • null =!= nullnull (not true)
  • null === "anything"null
  • null =!= "anything"null

The practical consequence is that null propagates through boolean expressions. A filter that includes a null comparison will drop those rows entirely — because Spark only keeps rows where the filter expression evaluates to true.

The Silent Drop Problem

This is where things get dangerous. You might filter expecting null rows to pass through:

val df = Seq(
  ("Alice", Some("alice@example.com")),
  ("Bob", None),
  ("Charlie", Some("charlie@example.com")),
  ("Diana", None),
  ("Eve", Some("eve@example.com")),
).toDF("name", "email")

// Trying to exclude Alice — expecting null rows to pass through
val result = df.filter(col("email") =!= lit("alice@example.com"))

result.show(false)
// +-------+-------------------+
// |name   |email              |
// +-------+-------------------+
// |Charlie|charlie@example.com|
// |Eve    |eve@example.com    |
// +-------+-------------------+

Bob and Diana are gone. The filter null =!= "alice@example.com" evaluates to null, which Spark treats as "don't include this row." This is a silent data loss — no error, no warning, just missing rows.

The Fix: Use isNull and isNotNull

When you want to check whether a value is missing, use .isNull and .isNotNull. These are designed for null checks and always return true or false — never null.

val df = Seq(
  ("Alice", Some("alice@example.com")),
  ("Bob", None),
  ("Charlie", Some("charlie@example.com")),
  ("Diana", None),
  ("Eve", Some("eve@example.com")),
).toDF("name", "email")

val result = df.withColumn("has_email", col("email").isNotNull)

result.show(false)
// +-------+-------------------+---------+
// |name   |email              |has_email|
// +-------+-------------------+---------+
// |Alice  |alice@example.com  |true     |
// |Bob    |null               |false    |
// |Charlie|charlie@example.com|true     |
// |Diana  |null               |false    |
// |Eve    |eve@example.com    |true     |
// +-------+-------------------+---------+

isNotNull gives you the expected boolean result for every row. Similarly, .isNull returns true for null rows and false for non-null rows.

Null-Safe Equality with <=>

If you want to filter out a specific value while keeping null rows, use the <=> operator (null-safe equality). Unlike ===, <=> treats null <=> null as true and never returns null.

val df = Seq(
  ("Alice", Some("alice@example.com")),
  ("Bob", None),
  ("Charlie", Some("charlie@example.com")),
  ("Diana", None),
  ("Eve", Some("eve@example.com")),
).toDF("name", "email")

// Exclude Alice, but preserve rows where email is null
val result = df.filter(!(col("email") <=> lit("alice@example.com")))

result.show(false)
// +-------+-------------------+
// |name   |email              |
// +-------+-------------------+
// |Bob    |null               |
// |Charlie|charlie@example.com|
// |Diana  |null               |
// |Eve    |eve@example.com    |
// +-------+-------------------+

!(col("email") <=> lit("alice@example.com")) reads as "not null-safely equal to alice@example.com". For null rows, null <=> "alice@example.com" is false, so the negation is true — the row passes through. Alice's row is the only one excluded.

Quick Reference

Expression Non-null result Null result
col === lit("x") true or false null
col =!= lit("x") true or false null
col.isNull false true
col.isNotNull true false
col <=> lit("x") true or false false
!(col <=> lit("x")) true or false true

The rule of thumb: never use === or =!= to detect missing values. Always use isNull or isNotNull. If you need to filter on a value while preserving nulls, reach for <=>.

For more on working with null values in Spark, the coalesce function is useful for replacing nulls with fallback values.

Tutorial Details

Created: 2026-03-19 01:43:53 AM

Last Updated: 2026-03-19 01:43:53 AM