Why === and Not == for Column Equality in Spark Scala
Everyone coming to Spark from Scala (or from Pandas) tries col("x") == "value" at least once. It looks right, it sometimes even compiles, and then nothing matches. Spark uses === for column equality — not == — and the reason traces back to a hard limit in the Scala language itself.
What Each Operator Returns
The clearest way to see the difference is to print what each operator gives you back:
val statusColumn = col("status")
val tripleEquals = statusColumn === "active"
println(s"=== returns: $tripleEquals")
println(s" class: ${tripleEquals.getClass.getName}")
val doubleEquals = statusColumn == "active"
println(s"== returns: $doubleEquals")
println(s" class: ${doubleEquals.getClass.getName}")
// === returns: (status = active)
// class: org.apache.spark.sql.Column
// == returns: false
// class: boolean
=== produces a Column expression — the unevaluated predicate (status = active) that Spark will run against each row at query time. == produces a plain Scala Boolean. Specifically, it returns false, because a Column object is not equal to the string "active" as far as JVM object equality is concerned.
That Boolean has no place in a query plan. It's just a value the JVM computed on the driver before Spark ever saw it.
Why Spark Couldn't Just Overload ==
In Scala, == is final on Any — it always calls .equals(other) under the hood and always returns Boolean. There's no way for a library to change that. Spark needs the equality operator to return a Column (an expression tree the Catalyst optimizer can compile and push down), so it had to pick a different symbol. === is a regular method on Column, free to return whatever Spark wants — in this case, a new Column representing the comparison.
This is the same reason you'll see =!= for not-equal, <=> for null-safe equality, and so on. None of these can reuse the built-in operators.
Filtering: The Most Common Trap
This is where == bites hardest. You want to filter rows where a column matches a value:
val df = Seq(
("Alice", "active"),
("Bob", "inactive"),
("Charlie", "active"),
("Diana", "pending"),
("Eve", "active"),
).toDF("name", "status")
val result = df.filter(col("status") === "active")
result.show(false)
// +-------+------+
// |name |status|
// +-------+------+
// |Alice |active|
// |Charlie|active|
// |Eve |active|
// +-------+------+
With ===, filter receives a Column expression and Spark evaluates it row by row. Three rows match.
Swap === for == and you're feeding filter a Boolean. In most cases the Scala compiler catches this — Dataset.filter doesn't accept a bare Boolean, so you'll get a type-mismatch error before the code ever runs. That's the lucky outcome. The unlucky outcome is when the surrounding code coerces the boolean into something else (an expr string, a when chain) and the comparison silently turns into a constant false baked into the query plan, dropping every row.
Comparing Two Columns
The same rule applies when comparing columns against each other. Use ===:
val df = Seq(
("Alice", "Alice"),
("Bob", "Robert"),
("Charlie", "Charlie"),
("Diana", "Di"),
("Eve", "Eve"),
).toDF("legal_name", "preferred_name")
val result = df.withColumn(
"uses_legal_name",
col("legal_name") === col("preferred_name"),
)
result.show(false)
// +----------+--------------+---------------+
// |legal_name|preferred_name|uses_legal_name|
// +----------+--------------+---------------+
// |Alice |Alice |true |
// |Bob |Robert |false |
// |Charlie |Charlie |true |
// |Diana |Di |false |
// |Eve |Eve |true |
// +----------+--------------+---------------+
The uses_legal_name column is computed per row by comparing the two column values. Each cell is a real true or false derived from the data.
The Confusing Edge Case: == Between Two Columns
Here's where intuition really breaks down. col("status") == col("status") doesn't return false — it returns true. Not at the row level. On the driver, before any data is touched.
val sameRef = col("status")
val objectEqualsSame = sameRef == sameRef
val objectEqualsTwoCols = col("status") == col("status")
val objectEqualsDifferent = col("status") == col("name")
println(s"sameRef == sameRef: $objectEqualsSame")
println(s"col(\"status\") == col(\"status\"): $objectEqualsTwoCols")
println(s"col(\"status\") == col(\"name\"): $objectEqualsDifferent")
// sameRef == sameRef: true
// col("status") == col("status"): true
// col("status") == col("name"): false
Spark's Column class overrides .equals to compare the underlying expression trees. Two columns built from col("status") produce the same expression, so == returns true. Two columns built from different names produce different expressions, so == returns false.
This isn't row-level comparison. It's structural comparison of the expression, computed once on the driver. If you wrote df.filter(col("status") == col("status")) and the compiler somehow accepted it, every row would either all pass or all fail based on that one driver-side boolean — never on the actual values in the data.
The Other Comparison Operators
The same === / == distinction extends across the whole comparison family:
| You probably meant | Use this | Don't use |
|---|---|---|
| Equal to | === |
== |
| Not equal to | =!= |
!= |
| Null-safe equal to | <=> |
(no plain equivalent) |
| Greater than / less than | > < >= <= |
(these work — they're not on Any) |
The arithmetic and ordering operators (>, <, +, -, etc.) aren't reserved by Any, so Spark defines them directly on Column and they behave the way you'd expect. The trap is specifically equality, and specifically the operators Scala has already nailed down.
Two related rabbit holes worth knowing about: =!= and null comparisons behave differently than you'd expect when nulls are involved, and the when function uses these same column expressions to build conditional logic.
The Rule
Reach for === any time you're comparing column values. If you find yourself typing == between anything involving a Column, stop — either the compiler is about to complain, or worse, it won't and the result will be a driver-side Boolean that has nothing to do with your data.