The Guide You Need
Spark Scala Tutorials
Straight forward spark scala tutorials. Learn best practices, databricks platform nuances and the latest in big data trends...
Page 1 of 2
-
Converting a Map Column to Individual Columns with getItem in Spark Scala
A MapType column is convenient for carrying a bag of key/value attributes through a pipeline, but most downstream work — filtering, joining, aggregating, writing tabular output — is easier when each key has its own column. This tutorial shows how to pull values out of a map with getItem, the apply() shorthand, what happens when a key is missing, and how to expand a map whose keys you don't know until runtime.
-
Reading a JSON File Where the Schema Varies Between Records in Spark Scala
Event streams, webhook dumps, and log files rarely have a uniform shape — one record carries x and y coordinates, the next has a price, the next an email. Spark's JSON reader handles this gracefully, but the rules for how it handles it are worth knowing before you ship a pipeline. This tutorial walks through schema inference's union behavior, what happens when the same field has different types across records, supplying an explicit schema to stay defensive, capturing malformed rows with _corrupt_record, and the "keep it as raw JSON" escape hatch for genuinely heterogeneous nested payloads.
-
Why collect() on a Large Spark Scala DataFrame Kills Your Driver
collect() is one of the most innocent-looking ways to crash a Spark job. It returns an Array[Row], which feels harmless — until you remember that the array has to fit in driver memory. This tutorial shows what collect() actually does, why it falls over on real datasets, and which APIs to reach for instead.
-
Why Your Spark Scala Join Doubled Your Row Count
You join two DataFrames on what you thought was a primary key, expect the same number of rows you started with, and instead get more. This isn't a bug in Spark — it's how SQL joins are defined. Every row on the left gets matched to every matching row on the right, and if either side has duplicate keys, the output multiplies.
-
Why === and Not == for Column Equality in Spark Scala
Everyone coming to Spark from Scala (or from Pandas) tries col("x") == "value" at least once. It looks right, it sometimes even compiles, and then nothing matches. Spark uses === for column equality — not == — and the reason traces back to a hard limit in the Scala language itself.
-
Writing a Custom Accumulator to Count Filtered Rows in Spark Scala
When a Spark job filters bad rows out of a dataset, you usually want to know how many — and ideally why. Accumulators are Spark's built-in tool for collecting metrics from worker tasks back to the driver. The built-in LongAccumulator handles a single counter; for per-reason breakdowns you need to write a custom one by extending AccumulatorV2.
-
Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala
JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.
-
Comparing Two DataFrames to Find Added, Removed, and Changed Rows in Spark Scala
Diffing two DataFrames — yesterday's snapshot against today's, or the output of a pipeline against an expected baseline — is one of the most useful patterns in Spark Scala. This tutorial walks through three approaches: exceptAll for full-row diffs, anti-joins for keyed adds and removes, and a full_outer join that classifies every row as ADDED, REMOVED, CHANGED, or UNCHANGED.
-
Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository
Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.
-
Using spark-submit with Private Maven Dependencies
spark-submit --packages resolves dependencies through Ivy, not sbt. That means none of the credential and resolver setup in your build.sbt carries over — the cluster needs its own configuration. Here's how to wire up --repositories, supply credentials, and decide whether --packages is even the right tool for the job.