Spark Scala Sum
sum is an aggregate function that totals the values in a numeric column. It's one of the most common operations in Spark Scala — used to roll up sales totals, computed metrics, running balances, and just about any group-level numeric summary.
Adding values within a group
sum has two overloads — one that takes a column name as a string, and one that takes a Column:
def sum(columnName: String): Column
def sum(e: Column): Column
Both return the total of the column's values within each group. The Column form is more flexible because you can pass an expression — not just a bare column reference.
val df = Seq(
("Alice", "US", 120.50),
("Bob", "US", 80.00),
("Carol", "UK", 45.25),
("Dave", "UK", 210.00),
("Eve", "US", 15.00),
("Frank", "CA", 300.00),
("Grace", "CA", 55.00),
).toDF("customer", "region", "amount")
val df2 = df
.groupBy("region")
.agg(sum(col("amount")).as("total_sales"))
.orderBy("region")
df2.show(false)
// +------+-----------+
// |region|total_sales|
// +------+-----------+
// |CA |355.0 |
// |UK |255.25 |
// |US |215.5 |
// +------+-----------+
The result type follows from the input type. Summing a Double column produces a Double. Summing an integer column produces a Long, which avoids overflow when totaling many large integers.
How sum handles nulls
sum skips null values silently. This is usually what you want, but it has one surprising edge case: if every value in a group is null, the result for that group is null — not 0.
val df = Seq(
("Alice", "US", Some(120.50)),
("Bob", "US", Some(80.00)),
("Carol", "UK", None),
("Dave", "UK", Some(210.00)),
("Eve", "US", None),
("Frank", "CA", None),
).toDF("customer", "region", "amount")
val df2 = df
.groupBy("region")
.agg(
count("*").as("orders"),
sum(col("amount")).as("total_sales"),
)
.orderBy("region")
df2.show(false)
// +------+------+-----------+
// |region|orders|total_sales|
// +------+------+-----------+
// |CA |1 |null |
// |UK |2 |210.0 |
// |US |3 |200.5 |
// +------+------+-----------+
Frank in CA is the only customer in that region and his amount is null, so total_sales comes back as null instead of 0.0. If you want a zero in that case, wrap the result with coalesce: coalesce(sum(col("amount")), lit(0.0)).
Totaling across the whole DataFrame
sum can also be used without groupBy to compute totals across all rows. The Column overload lets you sum an expression — useful for derived metrics like tax or commission:
val df = Seq(
("Alice", "US", 120.50),
("Bob", "US", 80.00),
("Carol", "UK", 45.25),
("Dave", "UK", 210.00),
("Eve", "US", 15.00),
).toDF("customer", "region", "amount")
val df2 = df.agg(
sum("amount").as("grand_total"),
sum(col("amount") * 0.10).as("tax_total"),
)
df2.show(false)
// +-----------+---------+
// |grand_total|tax_total|
// +-----------+---------+
// |470.75 |47.075 |
// +-----------+---------+
Calling .agg() directly on the DataFrame without groupBy aggregates everything into a single row. The tax_total column shows the power of the Column overload: Spark evaluates amount * 0.10 for each row before summing, with no extra withColumn step.
Running totals with a window
sum works as a window function too. Pair it with a Window spec to compute running totals, cumulative balances, or moving sums:
import org.apache.spark.sql.expressions.Window
val df = Seq(
("Alice", "2026-01-01", 120.50),
("Alice", "2026-01-02", 80.00),
("Alice", "2026-01-03", 45.25),
("Bob", "2026-01-01", 210.00),
("Bob", "2026-01-02", 15.00),
("Bob", "2026-01-03", 300.00),
).toDF("customer", "date", "amount")
val window = Window
.partitionBy("customer")
.orderBy("date")
val df2 = df
.withColumn("running_total", sum(col("amount")).over(window))
df2.show(false)
// +--------+----------+------+-------------+
// |customer|date |amount|running_total|
// +--------+----------+------+-------------+
// |Alice |2026-01-01|120.5 |120.5 |
// |Alice |2026-01-02|80.0 |200.5 |
// |Alice |2026-01-03|45.25 |245.75 |
// |Bob |2026-01-01|210.0 |210.0 |
// |Bob |2026-01-02|15.0 |225.0 |
// |Bob |2026-01-03|300.0 |525.0 |
// +--------+----------+------+-------------+
Each customer gets their own running total because the window is partitioned by customer and ordered by date. The first row in each partition equals the row's amount, and each subsequent row adds the current amount to the accumulated total.
Related Functions
For row counts within a group, see count and countDistinct. For conditional logic inside an aggregation, see when and otherwise. For replacing null aggregate results with a default value, see coalesce.