The Guide You Need
Spark Scala Tutorials
Straight forward spark scala tutorials. Learn best practices, databricks platform nuances and the latest in big data trends...
Page 1 of 1
-
Unpivoting Columns to Rows with stack in Spark Scala
Wide DataFrames — where each measure lives in its own column — are common in source data but awkward to aggregate, chart, or join. The stack generator function lets you unpivot those columns into rows without leaving the DataFrame API.
-
Using provided Scope for Spark Dependencies in sbt
Spark dependencies belong in provided scope. The cluster already has them — bundling them into your fat jar wastes space, causes version conflicts, and can break your application at runtime. Here's how provided works in sbt and what to watch out for.
-
Configuring sbt Assembly Merge Strategies for Spark Scala
Building a fat jar with sbt-assembly for a Spark project almost always hits duplicate file errors. Spark pulls in hundreds of transitive dependencies, and many of them bundle overlapping META-INF files, service descriptors, and even classes. Merge strategies tell sbt-assembly how to resolve these conflicts.
-
Using when Chains vs Map Lookups for Value Mapping in Spark Scala
Value mapping — translating one set of codes into another — comes up constantly in data pipelines. Spark gives you two clean ways to do it: chained when/otherwise expressions and Map lookups with typedLit. Each has strengths, and picking the right one depends on what you're mapping.
-
Setting JVM Options in sbt to Avoid Spark Test OOMs
Spark spins up a full local mini-cluster inside your test JVM — drivers, executors, shuffle infrastructure — which needs far more heap than sbt's default. Two settings in build.sbt fix this: forking the test JVM and sizing it properly.
-
Why null =!= null Returns null in Spark Scala, Not true
If you've ever tried to check for missing values using =!= or === in Spark and gotten surprising results, you've hit SQL three-valued logic. In Spark, null doesn't mean false — it means unknown — and that changes how comparisons behave in ways that can silently corrupt your data.
-
Creating DataFrames in Spark Scala for Testing with toDF
When testing your data engineering etl pipelines it can be a real help to quickly create simple DataFrames with the data scenarios you are transforming. Also, when you encounter problems in production that were unexpected, quickly creating test cases that account for that new situation are also highly beneficial. Thankfully the Spark Scala toDF function found in the implicits library can assist with this.
-
Spark Scala Cache Best Practices
Caching a DataFrame tells Spark to keep it in memory (or on disk) after the first time it's computed. This avoids recomputing the same transformations every time you trigger an action. Used well, it can dramatically speed up your pipelines. Used carelessly, it can eat all your memory and make things slower.