Spark Scala Sum

sum is an aggregate function that totals the values in a numeric column. It's one of the most common operations in Spark Scala — used to roll up sales totals, computed metrics, running balances, and just about any group-level numeric summary.

Adding values within a group

sum has two overloads — one that takes a column name as a string, and one that takes a Column:

def sum(columnName: String): Column

def sum(e: Column): Column

Both return the total of the column's values within each group. The Column form is more flexible because you can pass an expression — not just a bare column reference.

val df = Seq(
  ("Alice",   "US",  120.50),
  ("Bob",     "US",  80.00),
  ("Carol",   "UK",  45.25),
  ("Dave",    "UK",  210.00),
  ("Eve",     "US",  15.00),
  ("Frank",   "CA",  300.00),
  ("Grace",   "CA",  55.00),
).toDF("customer", "region", "amount")

val df2 = df
  .groupBy("region")
  .agg(sum(col("amount")).as("total_sales"))
  .orderBy("region")

df2.show(false)
// +------+-----------+
// |region|total_sales|
// +------+-----------+
// |CA    |355.0      |
// |UK    |255.25     |
// |US    |215.5      |
// +------+-----------+

The result type follows from the input type. Summing a Double column produces a Double. Summing an integer column produces a Long, which avoids overflow when totaling many large integers.

How sum handles nulls

sum skips null values silently. This is usually what you want, but it has one surprising edge case: if every value in a group is null, the result for that group is null — not 0.

val df = Seq(
  ("Alice",   "US",  Some(120.50)),
  ("Bob",     "US",  Some(80.00)),
  ("Carol",   "UK",  None),
  ("Dave",    "UK",  Some(210.00)),
  ("Eve",     "US",  None),
  ("Frank",   "CA",  None),
).toDF("customer", "region", "amount")

val df2 = df
  .groupBy("region")
  .agg(
    count("*").as("orders"),
    sum(col("amount")).as("total_sales"),
  )
  .orderBy("region")

df2.show(false)
// +------+------+-----------+
// |region|orders|total_sales|
// +------+------+-----------+
// |CA    |1     |null       |
// |UK    |2     |210.0      |
// |US    |3     |200.5      |
// +------+------+-----------+

Frank in CA is the only customer in that region and his amount is null, so total_sales comes back as null instead of 0.0. If you want a zero in that case, wrap the result with coalesce: coalesce(sum(col("amount")), lit(0.0)).

Totaling across the whole DataFrame

sum can also be used without groupBy to compute totals across all rows. The Column overload lets you sum an expression — useful for derived metrics like tax or commission:

val df = Seq(
  ("Alice",   "US",  120.50),
  ("Bob",     "US",  80.00),
  ("Carol",   "UK",  45.25),
  ("Dave",    "UK",  210.00),
  ("Eve",     "US",  15.00),
).toDF("customer", "region", "amount")

val df2 = df.agg(
  sum("amount").as("grand_total"),
  sum(col("amount") * 0.10).as("tax_total"),
)

df2.show(false)
// +-----------+---------+
// |grand_total|tax_total|
// +-----------+---------+
// |470.75     |47.075   |
// +-----------+---------+

Calling .agg() directly on the DataFrame without groupBy aggregates everything into a single row. The tax_total column shows the power of the Column overload: Spark evaluates amount * 0.10 for each row before summing, with no extra withColumn step.

Running totals with a window

sum works as a window function too. Pair it with a Window spec to compute running totals, cumulative balances, or moving sums:

import org.apache.spark.sql.expressions.Window

val df = Seq(
  ("Alice", "2026-01-01", 120.50),
  ("Alice", "2026-01-02", 80.00),
  ("Alice", "2026-01-03", 45.25),
  ("Bob",   "2026-01-01", 210.00),
  ("Bob",   "2026-01-02", 15.00),
  ("Bob",   "2026-01-03", 300.00),
).toDF("customer", "date", "amount")

val window = Window
  .partitionBy("customer")
  .orderBy("date")

val df2 = df
  .withColumn("running_total", sum(col("amount")).over(window))

df2.show(false)
// +--------+----------+------+-------------+
// |customer|date      |amount|running_total|
// +--------+----------+------+-------------+
// |Alice   |2026-01-01|120.5 |120.5        |
// |Alice   |2026-01-02|80.0  |200.5        |
// |Alice   |2026-01-03|45.25 |245.75       |
// |Bob     |2026-01-01|210.0 |210.0        |
// |Bob     |2026-01-02|15.0  |225.0        |
// |Bob     |2026-01-03|300.0 |525.0        |
// +--------+----------+------+-------------+

Each customer gets their own running total because the window is partitioned by customer and ordered by date. The first row in each partition equals the row's amount, and each subsequent row adds the current amount to the accumulated total.

For row counts within a group, see count and countDistinct. For conditional logic inside an aggregation, see when and otherwise. For replacing null aggregate results with a default value, see coalesce.

Spark Scala Sum

Adding values within a group

How sum handles nulls

Totaling across the whole DataFrame

Running totals with a window

Related Functions