Job Board
Consulting

Create Spark Map From Columns

Using a MapType in Spark Scala DataFrames can be helpful as it provides a flexible logical structures that can be used when solving problems such as: Machine Learning Feature Engineering, Data Exploration, Serialization, Enriching Data and Denormalization. Thankfully building these map spark scala columns is very stright fwd from existing data within a data frame.

Let's assume we have some data on people that work in warehouse that we'd like to structure into a map:

val df = Seq(
  (1, "Alice", "Warehouse Worker", "Logistics", "Day Shift"),
  (2, "Bob", "Warehouse Worker", "Logistics", "Night Shift"),
  (3, "Charlie", "Manager", "Logistics", null)
).toDF("id", "name", "role", "department", "shift")

We can use the map function to easily create the map column:

val df2 = df
  .withColumn("s", map(
    lit("id"), col("id"),
    lit("name"), col("name"),
    lit("role"), col("role"),
    lit("department"), col("department"),
    lit("shift"), col("shift"),
  ))

The resulting data will be:

// +---+-------+----------------+----------+-----------+-----------------------------------------------------------------------------------------------+
// |id |name   |role            |department|shift      |s                                                                                              |
// +---+-------+----------------+----------+-----------+-----------------------------------------------------------------------------------------------+
// |1  |Alice  |Warehouse Worker|Logistics |Day Shift  |{id -> 1, name -> Alice, role -> Warehouse Worker, department -> Logistics, shift -> Day Shift}|
// |2  |Bob    |Warehouse Worker|Logistics |Night Shift|{id -> 2, name -> Bob, role -> Warehouse Worker, department -> Logistics, shift -> Night Shift}|
// |3  |Charlie|Manager         |Logistics |null       |{id -> 3, name -> Charlie, role -> Manager, department -> Logistics, shift -> null}            |
// +---+-------+----------------+----------+-----------+-----------------------------------------------------------------------------------------------+

The map function requires 2n parameters to properly create the map. This is because the parameters map to the next one in the list. For example n1 => n2, n3 => n4, nx => nx + 1. This is variadic function taking in an arbitrary length parameter list as long as it meets the 2n requirement.

This simple example above uses the exact columns already in the dataframe. The next logical extenion though is that the map can be filled with derived columns. This ends up being a common way to enrich a dataset by combinging a multide of column together. Lets look at another simple example showing this:

val df = Seq(
  (1, "Alice", "111 Robins Ln", "NJ", "07712"),
  (2, "Bob", "222 Hawk Rd", "DE", "19701"),
  (3, "Charlie", "333 Sparoow Ct", "ME", "04401")
).toDF("id", "name", "street", "state", "zip")

val df2 = df
  .withColumn("address", map(
    lit("id"), col("id"),
    lit("name"), col("name"),
    lit("address"), concat_ws(" ", col("street"), col("state"), col("zip")),
  ))

df2.show(false)

// +---+-------+--------------+-----+-----+--------------------------------------------------------------+
// |id |name   |street        |state|zip  |address                                                       |
// +---+-------+--------------+-----+-----+--------------------------------------------------------------+
// |1  |Alice  |111 Robins Ln |NJ   |07712|{id -> 1, name -> Alice, address -> 111 Robins Ln NJ 07712}   |
// |2  |Bob    |222 Hawk Rd   |DE   |19701|{id -> 2, name -> Bob, address -> 222 Hawk Rd DE 19701}       |
// |3  |Charlie|333 Sparoow Ct|ME   |04401|{id -> 3, name -> Charlie, address -> 333 Sparoow Ct ME 04401}|
// +---+-------+--------------+-----+-----+--------------------------------------------------------------+

The sorts of spark scala data pipeline transformations can get very complex. Using map can help with many different types of processing and allow you to better structure and think about the data being manipulated.

Example Details

Created: 2023-09-29 12:36:00 PM

Last Updated: 2023-09-29 12:36:00 PM