Helping you Learn Spark Scala.
Find code samples, tutorials and the latest news at Sparking Scala. We make it easy to solve your data etl problems and help you go from code to valuable outcomes quickly.
Recent Spark Scala Examples
-
hypot in Spark Scala: Compute the Hypotenuse of Two DataFrame Columns
The hypot function computes sqrt(a² + b²) for two numeric inputs without the intermediate overflow or underflow that a naive implementation would produce. It's the standard tool for distances between points, vector magnitudes, and anywhere the Pythagorean theorem applies.
-
signum and sign in Spark Scala: Get the Sign of a Numeric DataFrame Column
signum returns -1.0 for negative numbers, 0.0 for zero, and 1.0 for positive numbers. It's useful when you care about the direction of a value but not its magnitude — flagging gains vs. losses, classifying deltas, or branching on the sign of a difference. Spark also exposes the SQL-only aliases sign, negative, and positive for working with the sign of a column.
-
factorial in Spark Scala: Compute the Factorial of an Integer Column in a DataFrame
The factorial function returns the factorial of an integer column — the product of all positive integers up to and including the input value (n! = n × (n-1) × ... × 2 × 1). It's useful anywhere you need to compute permutations, combinations, or other counting expressions inline in a DataFrame.
-
degrees and radians in Spark Scala: Convert Between Angle Units on DataFrame Columns
The degrees and radians functions convert DataFrame columns between the two ways of measuring angles. radians turns degrees into radians; degrees does the reverse. They're the unit-conversion helpers you reach for whenever your data is in degrees but you need to feed it into Spark's trig functions, which all expect radians.
-
Trigonometric Functions in Spark Scala: sin, cos, tan and More on DataFrame Columns
Spark Scala exposes the full set of trigonometric functions from java.lang.Math as DataFrame column functions: the basics (sin, cos, tan), their inverses (asin, acos, atan, atan2), the reciprocals (cot, csc, sec), and the hyperbolic versions of all of them. Every input and output is in radians, not degrees — use the radians function to convert if your data is in degrees.
-
exp and expm1 in Spark Scala: Exponential Functions on DataFrame Columns
Spark provides two exponential functions: exp computes e^x and expm1 computes e^x - 1. They're the inverses of log and log1p respectively, and you'll reach for them whenever you need to undo a log transform, compute compound growth, or work with continuous decay.
-
log, log2, log10, log1p, and ln in Spark Scala: Logarithms in a DataFrame
Spark provides a family of logarithm functions: log for natural log or an arbitrary base, log2 and log10 for the two most common bases, log1p for accurate results near zero, and ln (SQL-only) as an alias for the natural log. They all return Double and treat non-positive inputs as null rather than raising errors.
-
sqrt, cbrt, and pow in Spark Scala: Square Roots, Cube Roots, and Powers in a DataFrame
The sqrt, cbrt, and pow functions compute square roots, cube roots, and arbitrary powers of numeric columns. They return Double regardless of input type and behave like Java's Math.sqrt, Math.cbrt, and Math.pow — including how they handle negative inputs and special values like NaN.
-
ceil, floor, and rint in Spark Scala: Rounding to Integers in a DataFrame
The ceil, floor, and rint functions round a numeric column to an integer. ceil rounds up toward positive infinity, floor rounds down toward negative infinity, and rint rounds to the nearest integer using banker's rounding for exact halves. ceil and floor also accept a scale argument to round to a specific number of decimal places.
-
round and bround in Spark Scala: Rounding Numeric Columns in a DataFrame
The round and bround functions round numeric columns to a given number of decimal places. They differ in how they handle exact halves: round rounds half away from zero (the most common convention), while bround uses banker's rounding, which rounds half to the nearest even number to reduce bias in large aggregations.
Recent Spark Scala Tutorials
-
Why collect() on a Large Spark Scala DataFrame Kills Your Driver
collect() is one of the most innocent-looking ways to crash a Spark job. It returns an Array[Row], which feels harmless — until you remember that the array has to fit in driver memory. This tutorial shows what collect() actually does, why it falls over on real datasets, and which APIs to reach for instead.
-
Why Your Spark Scala Join Doubled Your Row Count
You join two DataFrames on what you thought was a primary key, expect the same number of rows you started with, and instead get more. This isn't a bug in Spark — it's how SQL joins are defined. Every row on the left gets matched to every matching row on the right, and if either side has duplicate keys, the output multiplies.
-
Why === and Not == for Column Equality in Spark Scala
Everyone coming to Spark from Scala (or from Pandas) tries col("x") == "value" at least once. It looks right, it sometimes even compiles, and then nothing matches. Spark uses === for column equality — not == — and the reason traces back to a hard limit in the Scala language itself.
-
Writing a Custom Accumulator to Count Filtered Rows in Spark Scala
When a Spark job filters bad rows out of a dataset, you usually want to know how many — and ideally why. Accumulators are Spark's built-in tool for collecting metrics from worker tasks back to the driver. The built-in LongAccumulator handles a single counter; for per-reason breakdowns you need to write a custom one by extending AccumulatorV2.
-
Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala
JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.
-
Comparing Two DataFrames to Find Added, Removed, and Changed Rows in Spark Scala
Diffing two DataFrames — yesterday's snapshot against today's, or the output of a pipeline against an expected baseline — is one of the most useful patterns in Spark Scala. This tutorial walks through three approaches: exceptAll for full-row diffs, anti-joins for keyed adds and removes, and a full_outer join that classifies every row as ADDED, REMOVED, CHANGED, or UNCHANGED.
-
Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository
Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.
-
Using spark-submit with Private Maven Dependencies
spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.
-
Configuring sbt for Private Maven Repositories in Spark Scala Projects
Most Spark teams keep shared utility libraries — schema definitions, custom UDFs, internal connectors — in a private Maven repository like Nexus or Artifactory. sbt resolves dependencies from Maven Central by default; pointing it at an internal repo and supplying credentials takes a handful of settings, but the order and the file locations matter.
-
Adding a Row Number Without a Natural Ordering Column in Spark Scala
Spark DataFrames have no built-in row order, so adding a row number isn't as simple as it is in SQL or pandas. When you don't have a column to sort by, you need a strategy to manufacture one. This tutorial walks through the options and their trade-offs.
Latest Spark Scala News
-
Delta Lake UniForm: Write Delta, Read Iceberg
Delta Lake's Universal Format generates Iceberg metadata alongside the Delta transaction log, against the same Parquet files, so Iceberg-native engines can read your Delta tables without conversion or duplication. With Delta 4.0.1 restoring Iceberg compat for Spark 4.0, UniForm is once again a usable option for Scala teams that need cross-engine reads — provided you understand the limitations.
-
Apache Polaris: The Open Standard Iceberg Catalog
Apache Polaris graduated to a top-level ASF project in February 2026 and is consolidating as the default open implementation of the Iceberg REST Catalog spec. For Spark Scala teams, it's the piece that lets Spark, Trino, and Flink work against the same Iceberg tables with one source of truth — without Hive Metastore, without per-engine catalog plumbing, and without vendor lock-in.
-
The New Apache Spark Kubernetes Operator: Getting Started
The official Apache Spark Kubernetes Operator launched as an ASF subproject in May 2025, built from scratch instead of forking the aging Kubeflow operator. A year of rapid releases later, it's at 0.9.0 and is the path the Spark community is steering toward for running Scala jobs on Kubernetes.
-
Spark Real-Time Mode vs Apache Flink: Is Spark Finally a True Streaming Engine?
Spark 4.1's Real-Time Mode delivers single-digit millisecond latency for stateless queries and targets sub-300ms p99 on the Databricks runtime, putting Spark within striking distance of Flink for the first time. For most analytics streaming, CDC, and ML feature workloads, RTM closes the gap. For sub-10ms requirements, complex event processing, and true event-at-a-time semantics, Flink still wins — and probably will for a long time.
-
Spark vs Polars: When to Use What in 2026
Polars is the fastest single-node DataFrame engine in open source right now — Rust-backed, multi-threaded, and on small data measurably quicker than DuckDB. It is also not a Spark replacement. For Spark Scala developers the honest framing is: Polars wins for single-node work under ~10GB, Spark wins for everything distributed or stateful, and the JVM story for Polars is thin. Here's a practical decision guide.
-
Apache Iceberg v3: What's New for Spark Users
Iceberg v3 is the first format-version bump since 2021 and finally lands the features Spark Scala teams have been working around for years: deletion vectors that replace position-delete files, mandatory row lineage for cheap CDC, a VARIANT type shared with Spark and Delta, default column values, geospatial types, and nanosecond timestamps. Here's what each feature actually does, which pieces are production-ready on Spark 4.0 today, and what to watch out for if you flip a table to format-version = 3.
-
Apache Gluten: Supercharging Spark with Native C++ Execution
Apache Gluten graduated to an ASF Top-Level Project in March 2026. It pushes Spark's physical operators down to native C++ engines (Velox or ClickHouse) via Substrait and JNI, keeping the JVM in charge of scheduling while the heavy lifting happens off-heap. Here's the architecture, how to wire it up on a Scala workload, and where the rough edges are.
-
Spark Connect for Scala: Building Thin-Client Applications
Spark Connect decouples the application from the cluster with a gRPC protocol, and as of Spark 4.0 the Scala client has near-complete DataFrame and Dataset API parity with classic mode. Here's the architecture, how to wire it up from sbt, and what still doesn't work.
-
The VARIANT Data Type in Spark 4.0: Semi-Structured Data Without Schema Headaches
Spark 4.0 added a native VARIANT type (SPARK-45827) for storing semi-structured data in a compact binary format you can query directly — no upfront schema, no from_json ceremony on every read. The published benchmarks show roughly 8x faster reads than storing the same payload as a JSON string column, and Spark 4.1 adds shredding to push that further. This article shows how the Scala API works, when to reach for VARIANT, and when you still want a strongly typed StructType.
-
ANSI Mode by Default in Spark 4.0: What Breaks and How to Fix It
Spark 4.0 flipped spark.sql.ansi.enabled from false to true, so invalid casts, arithmetic overflow, divide-by-zero, and bad array indices that used to silently return null now throw runtime errors. This guide catalogs each failure mode with the exception you'll see and the try_* function that fixes it without falling back to legacy mode.