Spark Scala Hash Functions
Hash functions serve many purposes in data engineering. They can be used to check the integrity of data, help with duplication issues, cryptographic use cases for security, improve efficiency when trying to balance partition sizes in large data operations and many more.
At the theoretical level, a hash function is a mathematical algorithm that takes an input and produces a fixed length output known as a hash value. It is deterministic, ensuring the same input always results in the same hash value and is they are designed to be fast to compute.
Hash functions are designed to easily compute the hash but not go in the other direction: from hash to original input value. Some hash functions are considered crypto graphically secure and others are not. This changes over time as compute power and techniques are continually changing the landscape of hash functions.
Different Spark Hashing Functions
In spark scala there are multiple different hash functions available that have different use cases:
def md5(e: Column): Column
The md5 hash is not considered cyptographically secure and must not be used for security purposes. For example password hashes. It is vulnerable to collisions and therefore md5 is deprectated. However, there are still some benefits when checksums and data duplication detection when security is not a concern.
def sha1(e: Column): Column
Once widly used sha1 is no longer considered cryptographically secure. It is also considered deprectated. Similar to md5 it can be used for non cryptograhic situations where speed is important and security is not. For example deplicate data detection.
def sha2(e: Column, numBits: Int): Column
Sha2 as of this writing is considered cryptographically secure. There are mutiple bit lengths that can be used: 224, 256, 384 or 512 bits. The longer the bits the slower the hashing algorithm and the more computationally secure it is.
def crc32(e: Column): Column
Crc32 or Cyclic Redundancy Check 32 is a hashing algorithm that is used detect errors in data. It is not considered cryptographically secure and is really only used for data integrity checks.
def hash(cols: Column*): Column
The hash
functions is a generic hashing function that under the hood uses the murmer3 algorithm. It is not cryptographically secure and is designed for speed and efficiency. This is a good option to use when using non cryptographically related hashing use cases.
def xxhash64(cols: Column*): Column
Also known as the Extremely Fast Hash 64, this is also not cryptographically secure and instead is used for hashing use cases where speed and efficiency are important but not security.
Spark Hash Function Examples
Let's see the spark scala hash functions in action:
val df = Seq("The quick brown fox...").toDF("data")
val df2 = df
.withColumn("md5", md5(col("data")))
.withColumn("sha1", sha1(col("data")))
.withColumn("sha2", sha2(col("data"), 256))
.withColumn("crc32", crc32(col("data")))
.withColumn("hash", hash(col("data")))
.withColumn("xxhash64", xxhash64(col("data")))
df2.show(false)
// +----------------------+--------------------------------+----------------------------------------+----------------------------------------------------------------+----------+---------+--------------------+
// |data |md5 |sha1 |sha2 |crc32 |hash |xxhash64 |
// +----------------------+--------------------------------+----------------------------------------+----------------------------------------------------------------+----------+---------+--------------------+
// |The quick brown fox...|9336f028b6d8712c67aa49a515d3e1fc|161d9faa371830b8a1b87798af228bd933bfc212|c42006f4f6e9397ed70efd189e277a6daf21b39a09deb1da4ca9dcd302d364c5|3501303675|996037910|-1261469731064246631|
// +----------------------+--------------------------------+----------------------------------------+----------------------------------------------------------------+----------+---------+--------------------+