Spark Scala Log, Log2, Log10, Log1p, and Ln
Spark provides a family of logarithm functions: log for natural log or an arbitrary base, log2 and log10 for the two most common bases, log1p for accurate results near zero, and ln (SQL-only) as an alias for the natural log. They all return Double and treat non-positive inputs as null rather than raising errors.
log
The most general of the bunch, log covers both the natural logarithm and arbitrary-base logarithms depending on which overload you call:
def log(e: Column): Column
def log(columnName: String): Column
def log(base: Double, a: Column): Column
def log(base: Double, columnName: String): Column
With a single argument, log returns the natural logarithm (base e). With a base parameter, it returns the logarithm of the column in that base.
val df = Seq(
1.0,
2.718281828459045,
10.0,
100.0,
0.5,
).toDF("value")
val df2 = df
.withColumn("ln", log(col("value")))
df2.show(false)
// +-----------------+-------------------+
// |value |ln |
// +-----------------+-------------------+
// |1.0 |0.0 |
// |2.718281828459045|1.0 |
// |10.0 |2.302585092994046 |
// |100.0 |4.605170185988092 |
// |0.5 |-0.6931471805599453|
// +-----------------+-------------------+
log(1) is 0 and log(e) is 1 — the natural log's defining values. Values between 0 and 1 produce negative results.
Arbitrary Base
Pass a Double as the first argument to compute a logarithm in any base. This is handy when neither base 2 nor base 10 fits your domain — say, when you're working with a base-5 growth rate or a base-12 musical scale:
val df = Seq(
8.0,
25.0,
100.0,
1024.0,
1000000.0,
).toDF("value")
val df2 = df
.withColumn("log_base_5", log(5.0, col("value")))
df2.show(false)
// +---------+------------------+
// |value |log_base_5 |
// +---------+------------------+
// |8.0 |1.2920296742201791|
// |25.0 |2.0 |
// |100.0 |2.8613531161467867|
// |1024.0 |4.306765580733931 |
// |1000000.0|8.58405934844036 |
// +---------+------------------+
log(5, 25) is exactly 2.0 because 5² = 25. The other rows are irrational and show full IEEE 754 precision.
log2 and log10
log2 and log10 are direct shortcuts for the two most-used bases. They behave exactly like log(2, x) and log(10, x) but read more naturally and skip the literal:
def log2(e: Column): Column
def log2(columnName: String): Column
def log10(e: Column): Column
def log10(columnName: String): Column
log2 is commonly used for measuring information content (bits) and for working with powers of two in storage or networking calculations. log10 lines up with orders of magnitude — every increase of 1.0 in the result means the input grew tenfold.
val df = Seq(
1.0,
2.0,
8.0,
100.0,
1024.0,
1000000.0,
).toDF("value")
val df2 = df
.withColumn("log2", log2(col("value")))
.withColumn("log10", log10(col("value")))
df2.show(false)
// +---------+------------------+------------------+
// |value |log2 |log10 |
// +---------+------------------+------------------+
// |1.0 |0.0 |0.0 |
// |2.0 |1.0 |0.3010299956639812|
// |8.0 |3.0 |0.9030899869919435|
// |100.0 |6.643856189774725 |2.0 |
// |1024.0 |10.0 |3.010299956639812 |
// |1000000.0|19.931568569324174|6.0 |
// +---------+------------------+------------------+
Notice the exact integers: log2(1024) is 10.0 because 2¹⁰ = 1024, and log10(1000000) is 6.0 because 10⁶ = 1,000,000.
log1p
log1p computes ln(1 + x). The point of having a separate function rather than writing log(lit(1) + col("x")) is numerical precision when x is very small — 1 + x loses significant digits in floating point, but log1p is computed in a way that preserves them:
def log1p(e: Column): Column
def log1p(columnName: String): Column
val df = Seq(
0.0,
0.0001,
0.01,
1.0,
10.0,
).toDF("value")
val df2 = df
.withColumn("log_of_value", log(col("value")))
.withColumn("log1p", log1p(col("value")))
df2.show(false)
// +------+------------------+--------------------+
// |value |log_of_value |log1p |
// +------+------------------+--------------------+
// |0.0 |null |0.0 |
// |1.0E-4|-9.210340371976182|9.999500033330834E-5|
// |0.01 |-4.605170185988091|0.009950330853168083|
// |1.0 |0.0 |0.6931471805599453 |
// |10.0 |2.302585092994046 |2.3978952727983707 |
// +------+------------------+--------------------+
A few things stand out. log1p(0) is 0 (because ln(1) = 0) where log(0) is null, so log1p is safer for inputs that may hit zero. For small x, log1p(x) ≈ x — see log1p(0.0001) ≈ 0.0001. This is exactly the property that compounding interest, log-likelihood, and growth-rate calculations rely on when the per-step value is tiny.
ln
ln is a SQL function that computes the natural logarithm. It isn't available directly in the org.apache.spark.sql.functions object, so you call it through expr():
def ln(expr): Column — via expr()
It's functionally identical to single-argument log — pick whichever reads better in context. SQL-heavy teams that share queries between languages may prefer ln for consistency with the SQL standard.
val df = Seq(
1.0,
2.718281828459045,
10.0,
100.0,
).toDF("value")
val df2 = df
.withColumn("ln_via_expr", expr("ln(value)"))
.withColumn("log_function", log(col("value")))
df2.show(false)
// +-----------------+-----------------+-----------------+
// |value |ln_via_expr |log_function |
// +-----------------+-----------------+-----------------+
// |1.0 |0.0 |0.0 |
// |2.718281828459045|1.0 |1.0 |
// |10.0 |2.302585092994046|2.302585092994046|
// |100.0 |4.605170185988092|4.605170185988092|
// +-----------------+-----------------+-----------------+
Zero, Negative, and Null Inputs
All of these functions (except log1p) follow the same rule for inputs outside the domain of the real logarithm: zero and negative numbers return null, not NaN or an exception. Nulls pass through as nulls.
val df = Seq(
Some(10.0),
Some(0.0),
Some(-1.0),
None,
Some(1.0),
).toDF("value")
val df2 = df
.withColumn("log", log(col("value")))
df2.show(false)
// +-----+-----------------+
// |value|log |
// +-----+-----------------+
// |10.0 |2.302585092994046|
// |0.0 |null |
// |-1.0 |null |
// |null |null |
// |1.0 |0.0 |
// +-----+-----------------+
This is friendlier than the Java Math.log behavior (which returns -Infinity for 0.0 and NaN for negatives) — Spark surfaces these as nulls so they flow through downstream nullability handling the same as any other missing value. If you specifically need to keep these rows but flag them, guard the call with a when.
Related Functions
For raising values to powers — the inverse operation of a logarithm — see sqrt, cbrt, and pow. For exponential decay and growth calculations, you'll often pair log with the exp and expm1 exponential functions.