Spark Scala Avg and Mean
avg and mean are aggregate functions that compute the arithmetic mean of values in a numeric column. They're the workhorse functions for summarizing data — average salary by department, average order size by region, moving average of a stock price. The two functions are identical; mean is just an alias for avg.
Averaging values within a group
avg has two overloads — one that takes a column name as a string, and one that takes a Column:
def avg(columnName: String): Column
def avg(e: Column): Column
Both return the arithmetic mean of the column's values within each group. The Column form is more flexible because you can pass an expression — not just a bare column reference.
val df = Seq(
("Alice", "Engineering", 120000.0),
("Bob", "Engineering", 95000.0),
("Carol", "Sales", 72000.0),
("Dave", "Sales", 88000.0),
("Eve", "Engineering", 110000.0),
("Frank", "Marketing", 65000.0),
("Grace", "Marketing", 78000.0),
).toDF("employee", "department", "salary")
val df2 = df
.groupBy("department")
.agg(avg(col("salary")).as("avg_salary"))
.orderBy("department")
df2.show(false)
// +-----------+------------------+
// |department |avg_salary |
// +-----------+------------------+
// |Engineering|108333.33333333333|
// |Marketing |71500.0 |
// |Sales |80000.0 |
// +-----------+------------------+
The result is always a Double, even when the input column is an integer type. This is because the average of integers is rarely an integer itself — Spark promotes the result rather than losing precision through truncation.
avg vs mean
mean is a direct alias for avg. Same signatures, same behavior, same result. Use whichever name reads better in the context of your code:
def mean(columnName: String): Column
def mean(e: Column): Column
val df = Seq(
("Alice", "Engineering", 120000.0),
("Bob", "Engineering", 95000.0),
("Carol", "Sales", 72000.0),
("Dave", "Sales", 88000.0),
("Eve", "Engineering", 110000.0),
).toDF("employee", "department", "salary")
val df2 = df
.groupBy("department")
.agg(
avg(col("salary")).as("avg_salary"),
mean(col("salary")).as("mean_salary"),
)
.orderBy("department")
df2.show(false)
// +-----------+------------------+------------------+
// |department |avg_salary |mean_salary |
// +-----------+------------------+------------------+
// |Engineering|108333.33333333333|108333.33333333333|
// |Sales |80000.0 |80000.0 |
// +-----------+------------------+------------------+
Both columns are identical. mean exists mostly for readability — code that reads "mean income by region" can feel more natural than "avg income by region" depending on the domain.
How avg handles nulls
avg skips null values silently. Both the numerator (sum) and the denominator (count) ignore nulls, so the result is the average of the non-null values in each group. This has one surprising edge case: if every value in a group is null, the result for that group is null — not 0.
val df = Seq(
("Alice", "Engineering", Some(120000.0)),
("Bob", "Engineering", Some(95000.0)),
("Carol", "Sales", None),
("Dave", "Sales", Some(88000.0)),
("Eve", "Engineering", None),
("Frank", "Marketing", None),
).toDF("employee", "department", "salary")
val df2 = df
.groupBy("department")
.agg(
count("*").as("headcount"),
avg(col("salary")).as("avg_salary"),
)
.orderBy("department")
df2.show(false)
// +-----------+---------+----------+
// |department |headcount|avg_salary|
// +-----------+---------+----------+
// |Engineering|3 |107500.0 |
// |Marketing |1 |null |
// |Sales |2 |88000.0 |
// +-----------+---------+----------+
In Engineering, Eve's null salary is excluded — the average is (120000 + 95000) / 2 = 107500, not divided by 3. Frank in Marketing is the only employee in that group and his salary is null, so avg_salary comes back as null instead of 0.0. If you want a default like 0 in that case, wrap the result with coalesce: coalesce(avg(col("salary")), lit(0.0)).
Moving averages with a window
avg works as a window function too. Pair it with a Window spec and a frame to compute moving averages — a common pattern for smoothing time series like stock prices, sensor readings, or daily traffic:
import org.apache.spark.sql.expressions.Window
val df = Seq(
("AAPL", "2026-01-01", 185.50),
("AAPL", "2026-01-02", 188.20),
("AAPL", "2026-01-03", 190.10),
("AAPL", "2026-01-04", 187.80),
("AAPL", "2026-01-05", 192.40),
("GOOG", "2026-01-01", 140.10),
("GOOG", "2026-01-02", 142.50),
("GOOG", "2026-01-03", 141.80),
("GOOG", "2026-01-04", 144.20),
("GOOG", "2026-01-05", 145.00),
).toDF("ticker", "date", "price")
val window = Window
.partitionBy("ticker")
.orderBy("date")
.rowsBetween(-2, 0)
val df2 = df
.withColumn("moving_avg", avg(col("price")).over(window))
df2.show(false)
// +------+----------+-----+------------------+
// |ticker|date |price|moving_avg |
// +------+----------+-----+------------------+
// |AAPL |2026-01-01|185.5|185.5 |
// |AAPL |2026-01-02|188.2|186.85 |
// |AAPL |2026-01-03|190.1|187.9333333333333 |
// |AAPL |2026-01-04|187.8|188.69999999999996|
// |AAPL |2026-01-05|192.4|190.1 |
// |GOOG |2026-01-01|140.1|140.1 |
// |GOOG |2026-01-02|142.5|141.3 |
// |GOOG |2026-01-03|141.8|141.46666666666667|
// |GOOG |2026-01-04|144.2|142.83333333333334|
// |GOOG |2026-01-05|145.0|143.66666666666666|
// +------+----------+-----+------------------+
The rowsBetween(-2, 0) frame includes the current row and the two rows preceding it — a 3-day moving average. Each ticker is averaged independently because the window is partitioned by ticker. The first row in each partition averages just itself, the second averages two rows, and from the third row onward you get the full 3-row average.
Related Functions
For totaling values rather than averaging them, see sum. For row counts within a group, see count and countDistinct. For replacing null aggregate results with a default value, see coalesce.