Spark Scala Count and Count Distinct

count, countDistinct, and count_if are aggregate functions for counting rows in a Spark Scala DataFrame. count counts rows or non-null values, countDistinct counts unique values, and count_if counts rows that match a condition.

Counting rows vs counting non-null values

The most common counting function is count. It has two forms — one that takes a column name as a string, and one that takes a Column:

def count(columnName: String): TypedColumn[Any, Long]

def count(e: Column): Column

The key thing to know about count is that count("*") counts every row, while count(col("amount")) counts only rows where amount is non-null. This is the same behavior as SQL's COUNT(*) vs COUNT(column).

val df = Seq(
  ("Alice",   "US",     Some(120.50)),
  ("Bob",     "US",     Some(80.00)),
  ("Carol",   "UK",     None),
  ("Dave",    "UK",     Some(45.25)),
  ("Eve",     "US",     None),
  ("Frank",   "CA",     Some(210.00)),
).toDF("customer", "region", "amount")

val df2 = df
  .groupBy("region")
  .agg(
    count("*").as("total_rows"),
    count(col("amount")).as("amount_not_null"),
  )
  .orderBy("region")

df2.show(false)
// +------+----------+---------------+
// |region|total_rows|amount_not_null|
// +------+----------+---------------+
// |CA    |1         |1              |
// |UK    |2         |1              |
// |US    |3         |2              |
// +------+----------+---------------+

In the US group there are three rows, but only two have a non-null amount. In the UK group, only one of the two rows has an amount. This null-skipping behavior is easy to miss and is a common source of off-by-one bugs in reports.

Counting distinct values

To count unique values in a column, use countDistinct. The function is defined as:

def countDistinct(expr: Column, exprs: Column*): Column

def countDistinct(columnName: String, columnNames: String*): Column

Spark also exposes a snake_case alias, count_distinct, that behaves identically. It was added in Spark 3.2.0:

def count_distinct(expr: Column, exprs: Column*): Column

val df = Seq(
  ("Alice",   "Widget"),
  ("Alice",   "Gadget"),
  ("Bob",     "Widget"),
  ("Carol",   "Widget"),
  ("Carol",   "Gadget"),
  ("Carol",   "Sprocket"),
  ("Dave",    "Widget"),
).toDF("customer", "product")

val df2 = df
  .groupBy("customer")
  .agg(
    count("*").as("orders"),
    countDistinct(col("product")).as("unique_products"),
  )
  .orderBy("customer")

df2.show(false)
// +--------+------+---------------+
// |customer|orders|unique_products|
// +--------+------+---------------+
// |Alice   |2     |2              |
// |Bob     |1     |1              |
// |Carol   |3     |3              |
// |Dave    |1     |1              |
// +--------+------+---------------+

Carol has three orders, but they happen to be for three different products, so unique_products matches orders in this case. If Carol had ordered the same widget twice, orders would be 3 and unique_products would be 2.

Counting distinct combinations across multiple columns

countDistinct and count_distinct accept multiple columns. Spark counts distinct tuples — combinations of values — rather than counting each column independently:

val df = Seq(
  ("Alice", "NYC", "Widget"),
  ("Alice", "NYC", "Widget"),
  ("Alice", "LA",  "Widget"),
  ("Alice", "LA",  "Gadget"),
  ("Bob",   "NYC", "Widget"),
  ("Bob",   "NYC", "Widget"),
).toDF("customer", "city", "product")

val df2 = df
  .groupBy("customer")
  .agg(
    count("*").as("rows"),
    count_distinct(col("city"), col("product")).as("distinct_city_product"),
  )
  .orderBy("customer")

df2.show(false)
// +--------+----+---------------------+
// |customer|rows|distinct_city_product|
// +--------+----+---------------------+
// |Alice   |4   |3                    |
// |Bob     |2   |1                    |
// +--------+----+---------------------+

Alice has four rows but only three distinct (city, product) pairs: (NYC, Widget), (LA, Widget), and (LA, Gadget). The duplicate (NYC, Widget) row collapses. Bob's two rows are identical, so they count as one distinct pair.

Distinct counts are expensive to compute on large datasets because Spark needs to deduplicate values across all partitions. For approximate counts that are much cheaper at scale, see approx_count_distinct.

Counting rows that match a condition

count_if counts the rows where a boolean expression is true. It's a Spark SQL function — it isn't available in the Scala functions object, so you call it through expr():

count_if(expr) — via expr()

The count_if function first appeared in version 3.0.0.

val df = Seq(
  ("Alice",   "US",  120.50),
  ("Bob",     "US",  80.00),
  ("Carol",   "US",  45.25),
  ("Dave",    "UK",  210.00),
  ("Eve",     "UK",  15.00),
  ("Frank",   "UK",  300.00),
  ("Grace",   "CA",  55.00),
).toDF("customer", "region", "amount")

val df2 = df
  .groupBy("region")
  .agg(
    count("*").as("orders"),
    expr("count_if(amount > 100)").as("big_orders"),
  )
  .orderBy("region")

df2.show(false)
// +------+------+----------+
// |region|orders|big_orders|
// +------+------+----------+
// |CA    |1     |0         |
// |UK    |3     |2         |
// |US    |3     |1         |
// +------+------+----------+

The same result can be achieved without count_if using sum(when(condition, 1).otherwise(0)) or count(when(condition, true)), but count_if reads more cleanly when you just need to count matches against a predicate. Rolling several conditional counts into a single agg call is also the fix for the filtered .count() triggers a full scan anti-pattern, where repeated df.filter(...).count() calls re-scan the source for every action.

For summing values in a group, see sum. For conditional logic inside aggregations, see when. For approximate distinct counts on very large datasets, see approx_count_distinct.

Spark Scala Count and Count Distinct

Counting rows vs counting non-null values

Counting distinct values

Counting distinct combinations across multiple columns

Counting rows that match a condition

Related functions