Job Board
Consulting

Spark Scala Min and Max

min and max are aggregate functions that return the smallest and largest values in a column. min_by and max_by go a step further — they return the value of one column at the row where another column reaches its extreme, so you can answer questions like "which employee earns the highest salary?" rather than just "what is the highest salary?".

Finding the smallest and largest values in a group

min and max each have two overloads — one that takes a column name as a string, and one that takes a Column:

def min(columnName: String): Column

def min(e: Column): Column

def max(columnName: String): Column

def max(e: Column): Column

The Column form is more flexible because you can pass an expression — not just a bare column reference.

val df = Seq(
  ("Alice",   "Engineering",  120000.0),
  ("Bob",     "Engineering",  95000.0),
  ("Carol",   "Sales",        72000.0),
  ("Dave",    "Sales",        88000.0),
  ("Eve",     "Engineering",  110000.0),
  ("Frank",   "Marketing",    65000.0),
  ("Grace",   "Marketing",    78000.0),
).toDF("employee", "department", "salary")

val df2 = df
  .groupBy("department")
  .agg(
    min(col("salary")).as("min_salary"),
    max(col("salary")).as("max_salary"),
  )
  .orderBy("department")

df2.show(false)
// +-----------+----------+----------+
// |department |min_salary|max_salary|
// +-----------+----------+----------+
// |Engineering|95000.0   |120000.0  |
// |Marketing  |65000.0   |78000.0   |
// |Sales      |72000.0   |88000.0   |
// +-----------+----------+----------+

min and max preserve the input type — summing salaries with sum may promote integers to longs, but min and max of a Double column return a Double, and of a String column return a String (compared lexicographically). They work on dates, timestamps, and any other ordered type as well.

How min and max handle nulls

Both functions skip null values silently. The result is the smallest or largest of the non-null values in each group. As with most Spark aggregates, if every value in a group is null, the result for that group is null.

val df = Seq(
  ("Alice",   "Engineering",  Some(120000.0)),
  ("Bob",     "Engineering",  Some(95000.0)),
  ("Carol",   "Sales",        None),
  ("Dave",    "Sales",        Some(88000.0)),
  ("Eve",     "Engineering",  None),
  ("Frank",   "Marketing",    None),
).toDF("employee", "department", "salary")

val df2 = df
  .groupBy("department")
  .agg(
    count("*").as("headcount"),
    min(col("salary")).as("min_salary"),
    max(col("salary")).as("max_salary"),
  )
  .orderBy("department")

df2.show(false)
// +-----------+---------+----------+----------+
// |department |headcount|min_salary|max_salary|
// +-----------+---------+----------+----------+
// |Engineering|3        |95000.0   |120000.0  |
// |Marketing  |1        |null      |null      |
// |Sales      |2        |88000.0   |88000.0   |
// +-----------+---------+----------+----------+

In Engineering, Eve's null salary is excluded, so the extremes come from Alice and Bob. Sales has only one non-null value (Dave at 88000), so min_salary and max_salary are both that single value. Frank in Marketing is the only employee in that group and his salary is null, so both columns come back as null. If you'd rather see a default like 0, wrap the result with coalesce: coalesce(min(col("salary")), lit(0.0)).

Finding the row associated with the extreme value

Knowing the highest salary is one thing; knowing who earns it is another. min_by and max_by return the value of one column at the row where another column reaches its minimum or maximum:

The min_by and max_by functions first appeared in version 3.3.0 and are defined as:

def min_by(e: Column, ord: Column): Column

def max_by(e: Column, ord: Column): Column

The first parameter is the column whose value you want back. The second is the column to rank by. For each group, Spark finds the row with the smallest or largest ord value and returns the corresponding e value.

val df = Seq(
  ("Alice",   "Engineering",  120000.0),
  ("Bob",     "Engineering",  95000.0),
  ("Carol",   "Sales",        72000.0),
  ("Dave",    "Sales",        88000.0),
  ("Eve",     "Engineering",  110000.0),
  ("Frank",   "Marketing",    65000.0),
  ("Grace",   "Marketing",    78000.0),
).toDF("employee", "department", "salary")

val df2 = df
  .groupBy("department")
  .agg(
    min_by(col("employee"), col("salary")).as("lowest_paid"),
    max_by(col("employee"), col("salary")).as("highest_paid"),
  )
  .orderBy("department")

df2.show(false)
// +-----------+-----------+------------+
// |department |lowest_paid|highest_paid|
// +-----------+-----------+------------+
// |Engineering|Bob        |Alice       |
// |Marketing  |Frank      |Grace       |
// |Sales      |Carol      |Dave        |
// +-----------+-----------+------------+

In Engineering, the maximum salary is 120000 — that row's employee value is Alice, so max_by returns "Alice". The same idea applies across all the columns: min_by and max_by are how you do "argmin" and "argmax" without resorting to a join or a window function. If multiple rows tie for the extreme, Spark returns the value from one of them — the choice is not guaranteed to be stable.

Rolling highs and lows with a window

min and max work as window functions too. Pair them with a Window spec and a frame to compute rolling minimums and maximums — useful for things like stock highs and lows, sensor envelopes, or detecting breakouts in time-series data:

import org.apache.spark.sql.expressions.Window

val df = Seq(
  ("AAPL", "2026-01-01", 185.50),
  ("AAPL", "2026-01-02", 188.20),
  ("AAPL", "2026-01-03", 190.10),
  ("AAPL", "2026-01-04", 187.80),
  ("AAPL", "2026-01-05", 192.40),
  ("GOOG", "2026-01-01", 140.10),
  ("GOOG", "2026-01-02", 142.50),
  ("GOOG", "2026-01-03", 141.80),
  ("GOOG", "2026-01-04", 144.20),
  ("GOOG", "2026-01-05", 145.00),
).toDF("ticker", "date", "price")

val window = Window
  .partitionBy("ticker")
  .orderBy("date")
  .rowsBetween(-2, 0)

val df2 = df
  .withColumn("rolling_low", min(col("price")).over(window))
  .withColumn("rolling_high", max(col("price")).over(window))

df2.show(false)
// +------+----------+-----+-----------+------------+
// |ticker|date      |price|rolling_low|rolling_high|
// +------+----------+-----+-----------+------------+
// |AAPL  |2026-01-01|185.5|185.5      |185.5       |
// |AAPL  |2026-01-02|188.2|185.5      |188.2       |
// |AAPL  |2026-01-03|190.1|185.5      |190.1       |
// |AAPL  |2026-01-04|187.8|187.8      |190.1       |
// |AAPL  |2026-01-05|192.4|187.8      |192.4       |
// |GOOG  |2026-01-01|140.1|140.1      |140.1       |
// |GOOG  |2026-01-02|142.5|140.1      |142.5       |
// |GOOG  |2026-01-03|141.8|140.1      |142.5       |
// |GOOG  |2026-01-04|144.2|141.8      |144.2       |
// |GOOG  |2026-01-05|145.0|141.8      |145.0       |
// +------+----------+-----+-----------+------------+

The rowsBetween(-2, 0) frame includes the current row and the two rows preceding it — a 3-day rolling window. AAPL's row on 2026-01-04 shows a rolling_low of 187.8 because the previous two-day window (190.1, 187.8) doesn't include the earlier 185.5 anymore. Each ticker is computed independently because the window is partitioned by ticker.

For totaling values within a group, see sum. For arithmetic means, see avg and mean. For row counts, see count and countDistinct. For replacing null aggregate results with a default value, see coalesce.

Example Details

Created: 2026-06-09 10:40:27 PM

Last Updated: 2026-06-09 10:40:27 PM