tutorials Spark Scala Code and Functions

Why .count() on a Filtered Spark Scala DataFrame Triggers a Full Scan

2026-06-10 10:35:08 PM

df.count() feels like it should be free — it's just counting rows. And on a plain table backed by Parquet or Iceberg, it more or less is: Spark can pull row counts straight out of the file footers. The moment you add a .filter() in front of it, that shortcut disappears. Spark has to read the filter column for every row to decide which ones to count, and every additional .count() you write re-runs the whole pipeline from scratch.

Converting a Map Column to Individual Columns with getItem in Spark Scala

2026-06-06 10:25:11 PM

A MapType column is convenient for carrying a bag of key/value attributes through a pipeline, but most downstream work — filtering, joining, aggregating, writing tabular output — is easier when each key has its own column. This tutorial shows how to pull values out of a map with getItem, the apply() shorthand, what happens when a key is missing, and how to expand a map whose keys you don't know until runtime.

Reading a JSON File Where the Schema Varies Between Records in Spark Scala

2026-06-03 10:54:01 PM

Event streams, webhook dumps, and log files rarely have a uniform shape — one record carries x and y coordinates, the next has a price, the next an email. Spark's JSON reader handles this gracefully, but the rules for how it handles it are worth knowing before you ship a pipeline. This tutorial walks through schema inference's union behavior, what happens when the same field has different types across records, supplying an explicit schema to stay defensive, capturing malformed rows with _corrupt_record, and the "keep it as raw JSON" escape hatch for genuinely heterogeneous nested payloads.

Why collect() on a Large Spark Scala DataFrame Kills Your Driver

2026-05-30 11:00:11 PM

collect() is one of the most innocent-looking ways to crash a Spark job. It returns an Array[Row], which feels harmless — until you remember that the array has to fit in driver memory. This tutorial shows what collect() actually does, why it falls over on real datasets, and which APIs to reach for instead.

Why Your Spark Scala Join Doubled Your Row Count

2026-05-27 10:23:50 PM

You join two DataFrames on what you thought was a primary key, expect the same number of rows you started with, and instead get more. This isn't a bug in Spark — it's how SQL joins are defined. Every row on the left gets matched to every matching row on the right, and if either side has duplicate keys, the output multiplies.

Why === and Not == for Column Equality in Spark Scala

2026-05-23 10:26:02 PM

Everyone coming to Spark from Scala (or from Pandas) tries col("x") == "value" at least once. It looks right, it sometimes even compiles, and then nothing matches. Spark uses === for column equality — not == — and the reason traces back to a hard limit in the Scala language itself.

Writing a Custom Accumulator to Count Filtered Rows in Spark Scala

2026-05-20 10:17:49 PM

When a Spark job filters bad rows out of a dataset, you usually want to know how many — and ideally why. Accumulators are Spark's built-in tool for collecting metrics from worker tasks back to the driver. The built-in LongAccumulator handles a single counter; for per-reason breakdowns you need to write a custom one by extending AccumulatorV2.

Flattening Deeply Nested Structs into a Flat DataFrame in Spark Scala

2026-05-16 10:26:49 PM

JSON events, Protobuf payloads, and Avro records often arrive in Spark as DataFrames with structs nested several levels deep. Most analytical work is easier on a flat schema. This tutorial walks through three approaches — explicit dotted paths, the .* star expansion, and a recursive flatten function — and finishes with the name-collision trap that catches most people the first time.

Comparing Two DataFrames to Find Added, Removed, and Changed Rows in Spark Scala

2026-05-13 10:39:54 PM

Diffing two DataFrames — yesterday's snapshot against today's, or the output of a pipeline against an expected baseline — is one of the most useful patterns in Spark Scala. This tutorial walks through three approaches: exceptAll for full-row diffs, anti-joins for keyed adds and removes, and a full_outer join that classifies every row as ADDED, REMOVED, CHANGED, or UNCHANGED.

Publishing Spark Scala Libraries from GitHub Actions to a Private Maven Repository

2026-05-09 10:47:06 PM

Once your team is sharing internal Spark libraries through a private Nexus, Artifactory, or CodeArtifact, you need a reliable way to publish new versions. Doing it from developer laptops causes version drift and credential sprawl. GitHub Actions can give you tagged, reproducible publishes — but wiring publishTo, credentials, and versioning into a workflow has a few sharp edges.

Spark Scala Tutorials