Job Board
Consulting

Helping you Learn Spark Scala.

Find code samples, tutorials and the latest news at Sparking Scala. We make it easy to solve your data etl problems and help you go from code to valuable outcomes quickly.

an angry spark cluster node that is having a bad day
an elite spark scala developer that is monitoring a structured streaming job
a data engineer that is trying to optimize his spark scala partition sizes in his etl pipeline
an idle spark cluster pool that is waiting to be pushed

Recent Spark Scala Examples

See more spark scala examples...

Recent Spark Scala Tutorials

  • Converting a Map Column to Individual Columns with getItem in Spark Scala

    A MapType column is convenient for carrying a bag of key/value attributes through a pipeline, but most downstream work — filtering, joining, aggregating, writing tabular output — is easier when each key has its own column. This tutorial shows how to pull values out of a map with getItem, the apply() shorthand, what happens when a key is missing, and how to expand a map whose keys you don't know until runtime.

  • Reading a JSON File Where the Schema Varies Between Records in Spark Scala

    Event streams, webhook dumps, and log files rarely have a uniform shape — one record carries x and y coordinates, the next has a price, the next an email. Spark's JSON reader handles this gracefully, but the rules for how it handles it are worth knowing before you ship a pipeline. This tutorial walks through schema inference's union behavior, what happens when the same field has different types across records, supplying an explicit schema to stay defensive, capturing malformed rows with _corrupt_record, and the "keep it as raw JSON" escape hatch for genuinely heterogeneous nested payloads.

  • Why collect() on a Large Spark Scala DataFrame Kills Your Driver

    collect() is one of the most innocent-looking ways to crash a Spark job. It returns an Array[Row], which feels harmless — until you remember that the array has to fit in driver memory. This tutorial shows what collect() actually does, why it falls over on real datasets, and which APIs to reach for instead.

  • Why Your Spark Scala Join Doubled Your Row Count

    You join two DataFrames on what you thought was a primary key, expect the same number of rows you started with, and instead get more. This isn't a bug in Spark — it's how SQL joins are defined. Every row on the left gets matched to every matching row on the right, and if either side has duplicate keys, the output multiplies.

  • Why === and Not == for Column Equality in Spark Scala

    Everyone coming to Spark from Scala (or from Pandas) tries col("x") == "value" at least once. It looks right, it sometimes even compiles, and then nothing matches. Spark uses === for column equality — not == — and the reason traces back to a hard limit in the Scala language itself.

  • Writing a Custom Accumulator to Count Filtered Rows in Spark Scala

    When a Spark job filters bad rows out of a dataset, you usually want to know how many — and ideally why. Accumulators are Spark's built-in tool for collecting metrics from worker tasks back to the driver. The built-in LongAccumulator handles a single counter; for per-reason breakdowns you need to write a custom one by extending AccumulatorV2.

  • Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala

    JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.

  • Comparing Two DataFrames to Find Added, Removed, and Changed Rows in Spark Scala

    Diffing two DataFrames — yesterday's snapshot against today's, or the output of a pipeline against an expected baseline — is one of the most useful patterns in Spark Scala. This tutorial walks through three approaches: exceptAll for full-row diffs, anti-joins for keyed adds and removes, and a full_outer join that classifies every row as ADDED, REMOVED, CHANGED, or UNCHANGED.

  • Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository

    Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.

  • Using spark-submit with Private Maven Dependencies

    spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.

See more spark scala tutorials...

Latest Spark Scala News

  • SQL Pipe Syntax in Spark 4.0: Writing More Readable Queries

    UNKNOWN

  • DuckLake: A New Lakehouse Format That Stores Metadata in SQL

    DuckLake is a new open lakehouse format from DuckDB Labs that puts table metadata in a standard SQL database — Postgres, MySQL, SQLite, or DuckDB — instead of writing thousands of small Avro and JSON files like Iceberg and Delta Lake do. v1.0 shipped in April 2026 under the MIT license. For Spark Scala teams the immediate story is not "rip out Iceberg," but the metadata-in-SQL idea is interesting enough to be worth understanding now.

  • MLflow 3.0: What Spark Scala Developers Need to Know

    MLflow 3.0 (released June 2025) rebuilt the platform around LoggedModel as a first-class entity, added GenAI tracing on top of OpenTelemetry, and reorganized how artifacts are stored. Most of the headline features land in the Python and TypeScript SDKs, but the JVM tracking client is still the path Scala teams use to log Spark ML runs from production code — and the changes underneath it are worth knowing before you upgrade.

  • Delta Lake UniForm: Write Delta, Read Iceberg

    Delta Lake's Universal Format generates Iceberg metadata alongside the Delta transaction log, against the same Parquet files, so Iceberg-native engines can read your Delta tables without conversion or duplication. With Delta 4.0.1 restoring Iceberg compat for Spark 4.0, UniForm is once again a usable option for Scala teams that need cross-engine reads — provided you understand the limitations.

  • Apache Polaris: The Open Standard Iceberg Catalog

    Apache Polaris graduated to a top-level ASF project in February 2026 and is consolidating as the default open implementation of the Iceberg REST Catalog spec. For Spark Scala teams, it's the piece that lets Spark, Trino, and Flink work against the same Iceberg tables with one source of truth — without Hive Metastore, without per-engine catalog plumbing, and without vendor lock-in.

  • The New Apache Spark Kubernetes Operator: Getting Started

    The official Apache Spark Kubernetes Operator launched as an ASF subproject in May 2025, built from scratch instead of forking the aging Kubeflow operator. A year of rapid releases later, it's at 0.9.0 and is the path the Spark community is steering toward for running Scala jobs on Kubernetes.

  • Spark Real-Time Mode vs Apache Flink: Is Spark Finally a True Streaming Engine?

    Spark 4.1's Real-Time Mode delivers single-digit millisecond latency for stateless queries and targets sub-300ms p99 on the Databricks runtime, putting Spark within striking distance of Flink for the first time. For most analytics streaming, CDC, and ML feature workloads, RTM closes the gap. For sub-10ms requirements, complex event processing, and true event-at-a-time semantics, Flink still wins — and probably will for a long time.

  • Spark vs Polars: When to Use What in 2026

    Polars is the fastest single-node DataFrame engine in open source right now — Rust-backed, multi-threaded, and on small data measurably quicker than DuckDB. It is also not a Spark replacement. For Spark Scala developers the honest framing is: Polars wins for single-node work under ~10GB, Spark wins for everything distributed or stateful, and the JVM story for Polars is thin. Here's a practical decision guide.

  • Apache Iceberg v3: What's New for Spark Users

    Iceberg v3 is the first format-version bump since 2021 and finally lands the features Spark Scala teams have been working around for years: deletion vectors that replace position-delete files, mandatory row lineage for cheap CDC, a VARIANT type shared with Spark and Delta, default column values, geospatial types, and nanosecond timestamps. Here's what each feature actually does, which pieces are production-ready on Spark 4.0 today, and what to watch out for if you flip a table to format-version = 3.

  • Apache Gluten: Supercharging Spark with Native C++ Execution

    Apache Gluten graduated to an ASF Top-Level Project in March 2026. It pushes Spark's physical operators down to native C++ engines (Velox or ClickHouse) via Substrait and JNI, keeping the JVM in charge of scheduling while the heavy lifting happens off-heap. Here's the architecture, how to wire it up on a Scala workload, and where the rough edges are.

See the latst big data news...