Job Board
Consulting

Helping you Learn Spark Scala.

Find code samples, tutorials and the latest news at Sparking Scala. We make it easy to solve your data etl problems and help you go from code to valuable outcomes quickly.

an angry spark cluster node that is having a bad day
an elite spark scala developer that is monitoring a structured streaming job
a data engineer that is trying to optimize his spark scala partition sizes in his etl pipeline
an idle spark cluster pool that is waiting to be pushed

Recent Spark Scala Examples

See more spark scala examples...

Recent Spark Scala Tutorials

  • Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository

    Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.

  • Using spark-submit with Private Maven Dependencies

    spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.

  • Configuring sbt for Private Maven Repositories in Spark Scala Projects

    Most Spark teams keep shared utility libraries — schema definitions, custom UDFs, internal connectors — in a private Maven repository like Nexus or Artifactory. sbt resolves dependencies from Maven Central by default; pointing it at an internal repo and supplying credentials takes a handful of settings, but the order and the file locations matter.

  • Adding a Row Number Without a Natural Ordering Column in Spark Scala

    Spark DataFrames have no built-in row order, so adding a row number isn't as simple as it is in SQL or pandas. When you don't have a column to sort by, you need a strategy to manufacture one. This tutorial walks through the options and their trade-offs.

  • Unpivoting Columns to Rows with stack in Spark Scala

    Wide DataFrames — where each measure lives in its own column — are common in source data but awkward to aggregate, chart, or join. The stack generator function lets you unpivot those columns into rows without leaving the DataFrame API.

  • Using provided Scope for Spark Dependencies in sbt

    Spark dependencies belong in provided scope. The cluster already has them — bundling them into your fat jar wastes space, causes version conflicts, and can break your application at runtime. Here's how provided works in sbt and what to watch out for.

  • Configuring sbt Assembly Merge Strategies for Spark Scala

    Building a fat jar with sbt-assembly for a Spark project almost always hits duplicate file errors. Spark pulls in hundreds of transitive dependencies, and many of them bundle overlapping META-INF files, service descriptors, and even classes. Merge strategies tell sbt-assembly how to resolve these conflicts.

  • Using when Chains vs Map Lookups for Value Mapping in Spark Scala

    Value mapping — translating one set of codes into another — comes up constantly in data pipelines. Spark gives you two clean ways to do it: chained when/otherwise expressions and Map lookups with typedLit. Each has strengths, and picking the right one depends on what you're mapping.

  • Setting JVM Options in sbt to Avoid Spark Test OOMs

    Spark spins up a full local mini-cluster inside your test JVM — drivers, executors, shuffle infrastructure — which needs far more heap than sbt's default. Two settings in build.sbt fix this: forking the test JVM and sizing it properly.

  • Why null =!= null Returns null in Spark Scala, Not true

    If you've ever tried to check for missing values using =!= or === in Spark and gotten surprising results, you've hit SQL three-valued logic. In Spark, null doesn't mean false — it means unknown — and that changes how comparisons behave in ways that can silently corrupt your data.

See more spark scala tutorials...

Latest Spark Scala News

  • ANSI Mode by Default in Spark 4.0: What Breaks and How to Fix It

    Spark 4.0 flipped spark.sql.ansi.enabled from false to true, so invalid casts, arithmetic overflow, divide-by-zero, and bad array indices that used to silently return null now throw runtime errors. This guide catalogs each failure mode with the exception you'll see and the try_* function that fixes it without falling back to legacy mode.

  • DuckDB for Spark Scala Developers: What You Need to Know

    DuckDB is an embedded, in-process columnar OLAP engine that runs inside your application — no cluster, no JVM serialization tax, sub-second startup. For Spark Scala developers, the entry point is the DuckDB JDBC driver on Maven Central, and the Scala 3 duck4s wrapper if you want a more idiomatic API. This is not a Spark replacement — it's a complementary tool that fits the gaps Spark was never designed to fill.

  • The JVM Is Not Dead: Why Scala Spark Still Makes Sense

    The "PySpark won, Scala is legacy" narrative is half right and half lazy. PySpark genuinely owns notebooks, ML, and the hiring funnel — but Spark itself still runs on the JVM, and Scala code still executes on the engine without a serialization boundary. Here's an honest look at where each language wins in 2026, and why Scala remains the right call for a meaningful slice of production work.

  • Apache Spark on Databricks vs Open Source in 2026

    The Databricks vs open-source Spark debate is usually framed as a feature comparison, but for Scala teams shipping production pipelines it's really a question about who owns operational complexity. Here's a practical decision guide for 2026 — where the gap has narrowed, where it persists, and what should actually drive the call.

  • Dependency Confusion Attacks and Your Private Spark Libraries

    Five years after Alex Birsan's original dependency confusion research collected more than $130,000 in bug bounties from Apple, Microsoft, PayPal, and Shopify, the same class of supply-chain attack is still landing in JVM builds. Spark Scala teams are an especially easy target — and sbt's default resolver behavior, combined with repository managers that auto-proxy Maven Central, makes the attack surprisingly close to a one-line publish on Sonatype's side.

  • Choosing a Private Maven Repository for Your Spark Scala Team in 2026

    Most comparison guides for artifact repositories are written for Java teams using Maven or Gradle. If your team builds Spark Scala applications with sbt, the landscape looks different — and some popular options have sharp edges that only show up with sbt's dependency resolution.

  • sbt 2.0 and What It Means for Spark Scala Projects

    sbt 2.0 is in its final release candidates with the 2.0.0 milestone fully closed. Build definitions now require Scala 3, all tasks are cached by default with Bazel-compatible remote caching, and the plugin ecosystem is being rebuilt. Here's what Spark Scala teams need to know before upgrading.

  • Apache Iceberg vs Delta Lake: Choosing a Table Format

    Both Iceberg and Delta Lake give you ACID transactions, time travel, and schema evolution on top of object storage. But they make different architectural trade-offs that matter when you're building Spark Scala pipelines. Here's a practical comparison to help you choose.

  • Scala 3 and Spark: Where Things Stand in 2026

    Apache Spark still ships exclusively for Scala 2.13, and official Scala 3 support has no target release. But a practical workaround exists today using Scala 3's forward-compatibility mode. Here's what works, what doesn't, and whether your team should try it.

  • Spark Declarative Pipelines: First Look from a Scala Dev

    Spark 4.1 introduces Spark Declarative Pipelines (SDP) — a framework for building managed ETL pipelines where you declare datasets and Spark handles the rest. The catch for Scala developers: authoring is Python and SQL only, with no JVM support yet.

See the latst big data news...