Rand Spark Scala Functions
Generating random values is a common need when creating data etl pipeline. They are useful for machine learning pipelines, data sampling and testing to mimic real data (synthetic data). The rand functions fill this need when working with DataFrames.
The rand
function is very straight forward to use:
val df = Seq(
"r1", "r2", "r3"
).toDF("example")
val df2 = df
.withColumn("rand_1", rand())
.withColumn("rand_2", rand())
df2.show()
// +-------+-------------------+------------------+
// |example| rand_1| rand_2|
// +-------+-------------------+------------------+
// | r1| 0.6198810034890265| 0.352138634626846|
// | r2|0.26587924055405765|0.7138521143385632|
// | r3| 0.3341393984726404|0.9428287842232883|
// +-------+-------------------+------------------+
There is also an overloaded version that takes in a 'seed' in order for you to better manage repeatable processes. To see the difference check out the side to side comparison when running rand
without the seed:
val df = Seq(
"r1", "r2", "r3"
).toDF("example")
val df2 = df
.withColumn("rand_1", rand())
.withColumn("rand_2", rand())
.withColumn("rand_with_seed_1", rand(123))
.withColumn("rand_with_seed_2", rand(123))
df2.show()
// +-------+------------------+-------------------+-------------------+-------------------+
// |example| rand_1| rand_2| rand_with_seed_1| rand_with_seed_2|
// +-------+------------------+-------------------+-------------------+-------------------+
// | r1|0.1915234586138973| 0.5090376242521805|0.15795279750951363|0.15795279750951363|
// | r2|0.9149545207987475| 0.5276911035265909| 0.648787283930924| 0.648787283930924|
// | r3|0.7626290448924116|0.49306198828040804| 0.9529333503403405| 0.9529333503403405|
// +-------+------------------+-------------------+-------------------+-------------------+
If you need to generate random numbers between specific ranges you can use basic arithmetic operations to dial in the values you need:
val df = Seq(
"r1", "r2", "r3"
).toDF("example")
val df2 = df
.withColumn("rand_1", rand()*100)
.withColumn("rand_2", (rand()*1000).cast("int"))
df2.show()
// +-------+-----------------+------+
// |example| rand_1|rand_2|
// +-------+-----------------+------+
// | r1|82.88237005646506| 718|
// | r2|75.81724887132683| 646|
// | r3|13.38625126584847| 491|
// +-------+-----------------+------+
Spark also provides a randn
function that returns values within the standard normal distribution. It works in a similar way as rand
:
val df = Seq(
"r1", "r2", "r3"
).toDF("example")
val df2 = df
.withColumn("randn", randn())
.withColumn("randn_with_seed", randn(123))
df2.show()
// +-------+--------------------+-------------------+
// |example| randn| randn_with_seed|
// +-------+--------------------+-------------------+
// | r1|0.038663043824896104|-0.9927550973188447|
// | r2| -0.6674422212920949| 0.4318390370193406|
// | r3| -0.5780742247477361| 0.2508362804439271|
// +-------+--------------------+-------------------+