The Reference You Need
Spark Scala Examples
Simple spark scala examples to help you quickly complete your data etl pipelines. Save time digging through the spark scala function api and instead get right to the code you need...
Page 1 of 9
-
first, last, first_value, last_value, any_value in Spark Scala: Picking Values from a Group
first and last are aggregate functions that return the earliest or latest value in a group — but only when the data is ordered. They're most useful as window functions paired with an orderBy clause, where "first" and "last" actually mean something specific. first_value and last_value are SQL-only synonyms, and any_value returns an arbitrary value when you don't care which one you get.
-
min, max, min_by, max_by in Spark Scala: Find Extremes in a DataFrame
min and max are aggregate functions that return the smallest and largest values in a column. min_by and max_by go a step further — they return the value of one column at the row where another column reaches its extreme, so you can answer questions like "which employee earns the highest salary?" rather than just "what is the highest salary?".
-
avg and mean in Spark Scala: Compute Column Averages in a DataFrame
avg and mean are aggregate functions that compute the arithmetic mean of values in a numeric column. They're the workhorse functions for summarizing data — average salary by department, average order size by region, moving average of a stock price. The two functions are identical; mean is just an alias for avg.
-
sum in Spark Scala: Aggregate Column Totals in a DataFrame
sum is an aggregate function that totals the values in a numeric column. It's one of the most common operations in Spark Scala — used to roll up sales totals, computed metrics, running balances, and just about any group-level numeric summary.
-
count and countDistinct in Spark Scala: Aggregate Row and Distinct Value Counts in a DataFrame
count, countDistinct, and count_if are aggregate functions for counting rows in a Spark Scala DataFrame. count counts rows or non-null values, countDistinct counts unique values, and count_if counts rows that match a condition.
-
Shift Functions in Spark Scala: shiftleft, shiftright, and shiftrightunsigned in a DataFrame
The bitwise shift functions move the bits of an integer column left or right by a fixed number of positions. They're useful for packing and unpacking flags, fast multiplication or division by powers of two, and working with binary protocol data.
-
e and pi in Spark Scala: Euler's Number and Pi Constants in a DataFrame
e() returns Euler's number (≈ 2.71828) and pi() returns π (≈ 3.14159). They're handy when a Spark expression needs one of these mathematical constants — for circle math, exponential growth, trigonometry, and so on — without you having to hard-code the literal value.
-
width_bucket in Spark Scala: Equiwidth Histogram Buckets in a DataFrame
width_bucket assigns a numeric value to an equiwidth histogram bucket given a range and a bucket count. It's the right tool when you need to bin continuous values into fixed-size groups — age brackets, price tiers, score ranges — without writing a chain of when expressions.
-
pmod in Spark Scala: Positive Modulo for DataFrame Columns
The pmod function returns the positive remainder of dividing one column by another. Unlike the standard % operator, which mirrors the sign of the dividend, pmod keeps the result non-negative whenever the divisor is positive. That makes it the right tool for hash bucketing and any cyclic indexing where a negative remainder would point you at the wrong bucket.
-
hypot in Spark Scala: Compute the Hypotenuse of Two DataFrame Columns
The hypot function computes sqrt(a² + b²) for two numeric inputs without the intermediate overflow or underflow that a naive implementation would produce. It's the standard tool for distances between points, vector magnitudes, and anywhere the Pythagorean theorem applies.