The Guide You Need
Spark Scala Tutorials
Straight forward spark scala tutorials. Learn best practices, databricks platform nuances and the latest in big data trends...
Page 1 of 2
-
Writing a Custom Accumulator to Count Filtered Rows in Spark Scala
When a Spark job filters bad rows out of a dataset, you usually want to know how many — and ideally why. Accumulators are Spark's built-in tool for collecting metrics from worker tasks back to the driver. The built-in LongAccumulator handles a single counter; for per-reason breakdowns you need to write a custom one by extending AccumulatorV2.
-
Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala
JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.
-
Comparing Two DataFrames to Find Added, Removed, and Changed Rows in Spark Scala
Diffing two DataFrames — yesterday's snapshot against today's, or the output of a pipeline against an expected baseline — is one of the most useful patterns in Spark Scala. This tutorial walks through three approaches: exceptAll for full-row diffs, anti-joins for keyed adds and removes, and a full_outer join that classifies every row as ADDED, REMOVED, CHANGED, or UNCHANGED.
-
Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository
Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.
-
Using spark-submit with Private Maven Dependencies
spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.
-
Configuring sbt for Private Maven Repositories in Spark Scala Projects
Most Spark teams keep shared utility libraries — schema definitions, custom UDFs, internal connectors — in a private Maven repository like Nexus or Artifactory. sbt resolves dependencies from Maven Central by default; pointing it at an internal repo and supplying credentials takes a handful of settings, but the order and the file locations matter.
-
Adding a Row Number Without a Natural Ordering Column in Spark Scala
Spark DataFrames have no built-in row order, so adding a row number isn't as simple as it is in SQL or pandas. When you don't have a column to sort by, you need a strategy to manufacture one. This tutorial walks through the options and their trade-offs.
-
Unpivoting Columns to Rows with stack in Spark Scala
Wide DataFrames — where each measure lives in its own column — are common in source data but awkward to aggregate, chart, or join. The stack generator function lets you unpivot those columns into rows without leaving the DataFrame API.
-
Using provided Scope for Spark Dependencies in sbt
Spark dependencies belong in provided scope. The cluster already has them — bundling them into your fat jar wastes space, causes version conflicts, and can break your application at runtime. Here's how provided works in sbt and what to watch out for.
-
Configuring sbt Assembly Merge Strategies for Spark Scala
Building a fat jar with sbt-assembly for a Spark project almost always hits duplicate file errors. Spark pulls in hundreds of transitive dependencies, and many of them bundle overlapping META-INF files, service descriptors, and even classes. Merge strategies tell sbt-assembly how to resolve these conflicts.