object functions
Commonly used functions available for DataFrame operations. Using functions defined here provides a little bit more compiletime safety to make sure the function exists.
Spark also includes more builtin functions that are less common and are not defined here.
You can still access them (and all the functions defined here) using the functions.expr()
API
and calling them through a SQL expression string. You can find the entire list of functions
at SQL API documentation of your Spark version, see also
the latest list
As an example, isnan
is a function that is defined here. You can use isnan(col("myCol"))
to invoke the isnan
function. This way the programming language's compiler ensures isnan
exists and is of the proper form. You can also use expr("isnan(myCol)")
function to invoke the
same function. In this case, Spark itself will ensure isnan
exists when it analyzes the query.
regr_count
is an example of a function that is builtin but not defined here, because it is
less commonly used. To invoke it, use expr("regr_count(yCol, xCol)")
.
This function APIs usually have methods with Column
signature only because it can support not
only Column
but also other types such as a native string. The other variants currently exist
for historical reasons.
 Annotations
 @Stable()
 Since
1.3.0
 Grouped
 Alphabetic
 By Inheritance
 functions
 AnyRef
 Any
 Hide All
 Show All
 Public
 All
Value Members

final
def
!=(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

final
def
##(): Int
 Definition Classes
 AnyRef → Any

final
def
==(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
abs(e: Column): Column
Computes the absolute value of a numeric value.
Computes the absolute value of a numeric value.
 Since
1.3.0

def
acos(columnName: String): Column
 returns
inverse cosine of
columnName
, as if computed byjava.lang.Math.acos
 Since
1.4.0

def
acos(e: Column): Column
 returns
inverse cosine of
e
in radians, as if computed byjava.lang.Math.acos
 Since
1.4.0

def
acosh(columnName: String): Column
 returns
inverse hyperbolic cosine of
columnName
 Since
3.1.0

def
acosh(e: Column): Column
 returns
inverse hyperbolic cosine of
e
 Since
3.1.0

def
add_months(startDate: Column, numMonths: Column): Column
Returns the date that is
numMonths
afterstartDate
.Returns the date that is
numMonths
afterstartDate
. startDate
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 numMonths
A column of the number of months to add to
startDate
, can be negative to subtract months returns
A date, or null if
startDate
was a string that could not be cast to a date
 Since
3.0.0

def
add_months(startDate: Column, numMonths: Int): Column
Returns the date that is
numMonths
afterstartDate
.Returns the date that is
numMonths
afterstartDate
. startDate
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 numMonths
The number of months to add to
startDate
, can be negative to subtract months returns
A date, or null if
startDate
was a string that could not be cast to a date
 Since
1.5.0

def
aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x))
 expr
the input array column
 initialValue
the initial value
 merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
 Since
3.0.0

def
aggregate(expr: Column, initialValue: Column, merge: (Column, Column) ⇒ Column, finish: (Column) ⇒ Column): Column
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state.
Applies a binary operator to an initial state and all elements in the array, and reduces this to a single state. The final state is converted into the final result by applying a finish function.
df.select(aggregate(col("i"), lit(0), (acc, x) => acc + x, _ * 10))
 expr
the input array column
 initialValue
the initial value
 merge
(combined_value, input_value) => combined_value, the merge function to merge an input value to the combined_value
 finish
combined_value => final_value, the lambda function to convert the combined value of all inputs to final result
 Since
3.0.0

def
approx_count_distinct(columnName: String, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
 rsd
maximum relative standard deviation allowed (default = 0.05)
 Since
2.1.0

def
approx_count_distinct(e: Column, rsd: Double): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
 rsd
maximum relative standard deviation allowed (default = 0.05)
 Since
2.1.0

def
approx_count_distinct(columnName: String): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
 Since
2.1.0

def
approx_count_distinct(e: Column): Column
Aggregate function: returns the approximate number of distinct items in a group.
Aggregate function: returns the approximate number of distinct items in a group.
 Since
2.1.0

def
array(colName: String, colNames: String*): Column
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
 Annotations
 @varargs()
 Since
1.4.0

def
array(cols: Column*): Column
Creates a new array column.
Creates a new array column. The input columns must all have the same data type.
 Annotations
 @varargs()
 Since
1.4.0

def
array_contains(column: Column, value: Any): Column
Returns null if the array is null, true if the array contains
value
, and false otherwise.Returns null if the array is null, true if the array contains
value
, and false otherwise. Since
1.5.0

def
array_distinct(e: Column): Column
Removes duplicate values from the array.
Removes duplicate values from the array.
 Since
2.4.0

def
array_except(col1: Column, col2: Column): Column
Returns an array of the elements in the first array but not in the second array, without duplicates.
Returns an array of the elements in the first array but not in the second array, without duplicates. The order of elements in the result is not determined
 Since
2.4.0

def
array_intersect(col1: Column, col2: Column): Column
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
Returns an array of the elements in the intersection of the given two arrays, without duplicates.
 Since
2.4.0

def
array_join(column: Column, delimiter: String): Column
Concatenates the elements of
column
using thedelimiter
.Concatenates the elements of
column
using thedelimiter
. Since
2.4.0

def
array_join(column: Column, delimiter: String, nullReplacement: String): Column
Concatenates the elements of
column
using thedelimiter
.Concatenates the elements of
column
using thedelimiter
. Null values are replaced withnullReplacement
. Since
2.4.0

def
array_max(e: Column): Column
Returns the maximum value in the array.
Returns the maximum value in the array. NaN is greater than any nonNaN elements for double/float type. NULL elements are skipped.
 Since
2.4.0

def
array_min(e: Column): Column
Returns the minimum value in the array.
Returns the minimum value in the array. NaN is greater than any nonNaN elements for double/float type. NULL elements are skipped.
 Since
2.4.0

def
array_position(column: Column, value: Any): Column
Locates the position of the first occurrence of the value in the given array as long.
Locates the position of the first occurrence of the value in the given array as long. Returns null if either of the arguments are null.
 Since
2.4.0
 Note
The position is not zero based, but 1 based index. Returns 0 if value could not be found in array.

def
array_remove(column: Column, element: Any): Column
Remove all elements that equal to element from the given array.
Remove all elements that equal to element from the given array.
 Since
2.4.0

def
array_repeat(e: Column, count: Int): Column
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
 Since
2.4.0

def
array_repeat(left: Column, right: Column): Column
Creates an array containing the left argument repeated the number of times given by the right argument.
Creates an array containing the left argument repeated the number of times given by the right argument.
 Since
2.4.0

def
array_sort(e: Column): Column
Sorts the input array in ascending order.
Sorts the input array in ascending order. The elements of the input array must be orderable. NaN is greater than any nonNaN elements for double/float type. Null elements will be placed at the end of the returned array.
 Since
2.4.0

def
array_union(col1: Column, col2: Column): Column
Returns an array of the elements in the union of the given two arrays, without duplicates.
Returns an array of the elements in the union of the given two arrays, without duplicates.
 Since
2.4.0

def
arrays_overlap(a1: Column, a2: Column): Column
Returns
true
ifa1
anda2
have at least one nonnull element in common.Returns
true
ifa1
anda2
have at least one nonnull element in common. If not and both the arrays are nonempty and any of them contains anull
, it returnsnull
. It returnsfalse
otherwise. Since
2.4.0

def
arrays_zip(e: Column*): Column
Returns a merged array of structs in which the Nth struct contains all Nth values of input arrays.
Returns a merged array of structs in which the Nth struct contains all Nth values of input arrays.
 Annotations
 @varargs()
 Since
2.4.0

final
def
asInstanceOf[T0]: T0
 Definition Classes
 Any

def
asc(columnName: String): Column
Returns a sort expression based on ascending order of the column.
Returns a sort expression based on ascending order of the column.
df.sort(asc("dept"), desc("age"))
 Since
1.3.0

def
asc_nulls_first(columnName: String): Column
Returns a sort expression based on ascending order of the column, and null values return before nonnull values.
Returns a sort expression based on ascending order of the column, and null values return before nonnull values.
df.sort(asc_nulls_first("dept"), desc("age"))
 Since
2.1.0

def
asc_nulls_last(columnName: String): Column
Returns a sort expression based on ascending order of the column, and null values appear after nonnull values.
Returns a sort expression based on ascending order of the column, and null values appear after nonnull values.
df.sort(asc_nulls_last("dept"), desc("age"))
 Since
2.1.0

def
ascii(e: Column): Column
Computes the numeric value of the first character of the string column, and returns the result as an int column.
Computes the numeric value of the first character of the string column, and returns the result as an int column.
 Since
1.5.0

def
asin(columnName: String): Column
 returns
inverse sine of
columnName
, as if computed byjava.lang.Math.asin
 Since
1.4.0

def
asin(e: Column): Column
 returns
inverse sine of
e
in radians, as if computed byjava.lang.Math.asin
 Since
1.4.0

def
asinh(columnName: String): Column
 returns
inverse hyperbolic sine of
columnName
 Since
3.1.0

def
asinh(e: Column): Column
 returns
inverse hyperbolic sine of
e
 Since
3.1.0

def
assert_true(c: Column, e: Column): Column
Returns null if the condition is true; throws an exception with the error message otherwise.
Returns null if the condition is true; throws an exception with the error message otherwise.
 Since
3.1.0

def
assert_true(c: Column): Column
Returns null if the condition is true, and throws an exception otherwise.
Returns null if the condition is true, and throws an exception otherwise.
 Since
3.1.0

def
atan(columnName: String): Column
 returns
inverse tangent of
columnName
, as if computed byjava.lang.Math.atan
 Since
1.4.0

def
atan(e: Column): Column
 returns
inverse tangent of
e
as if computed byjava.lang.Math.atan
 Since
1.4.0

def
atan2(yValue: Double, xName: String): Column
 yValue
coordinate on yaxis
 xName
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(yValue: Double, x: Column): Column
 yValue
coordinate on yaxis
 x
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(yName: String, xValue: Double): Column
 yName
coordinate on yaxis
 xValue
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(y: Column, xValue: Double): Column
 y
coordinate on yaxis
 xValue
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(yName: String, xName: String): Column
 yName
coordinate on yaxis
 xName
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(yName: String, x: Column): Column
 yName
coordinate on yaxis
 x
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(y: Column, xName: String): Column
 y
coordinate on yaxis
 xName
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atan2(y: Column, x: Column): Column
 y
coordinate on yaxis
 x
coordinate on xaxis
 returns
the theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by
java.lang.Math.atan2
 Since
1.4.0

def
atanh(columnName: String): Column
 returns
inverse hyperbolic tangent of
columnName
 Since
3.1.0

def
atanh(e: Column): Column
 returns
inverse hyperbolic tangent of
e
 Since
3.1.0

def
avg(columnName: String): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
 Since
1.3.0

def
avg(e: Column): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group.
 Since
1.3.0

def
base64(e: Column): Column
Computes the BASE64 encoding of a binary column and returns it as a string column.
Computes the BASE64 encoding of a binary column and returns it as a string column. This is the reverse of unbase64.
 Since
1.5.0

def
bin(columnName: String): Column
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
 Since
1.5.0

def
bin(e: Column): Column
An expression that returns the string representation of the binary value of the given long column.
An expression that returns the string representation of the binary value of the given long column. For example, bin("12") returns "1100".
 Since
1.5.0

def
bit_length(e: Column): Column
Calculates the bit length for the specified string column.
Calculates the bit length for the specified string column.
 Since
3.3.0

def
bitwise_not(e: Column): Column
Computes bitwise NOT (~) of a number.
Computes bitwise NOT (~) of a number.
 Since
3.2.0

def
broadcast[T](df: Dataset[T]): Dataset[T]
Marks a DataFrame as small enough for use in broadcast joins.
Marks a DataFrame as small enough for use in broadcast joins.
The following example marks the right DataFrame for broadcast hash join using
joinKey
.// left and right are DataFrames left.join(broadcast(right), "joinKey")
 Since
1.5.0

def
bround(e: Column, scale: Int): Column
Round the value of
e
toscale
decimal places with HALF_EVEN round mode ifscale
is greater than or equal to 0 or at integral part whenscale
is less than 0.Round the value of
e
toscale
decimal places with HALF_EVEN round mode ifscale
is greater than or equal to 0 or at integral part whenscale
is less than 0. Since
2.0.0

def
bround(e: Column): Column
Returns the value of the column
e
rounded to 0 decimal places with HALF_EVEN round mode.Returns the value of the column
e
rounded to 0 decimal places with HALF_EVEN round mode. Since
2.0.0

def
bucket(numBuckets: Int, e: Column): Column
A transform for any type that partitions by a hash of the input column.
A transform for any type that partitions by a hash of the input column.
 Since
3.0.0

def
bucket(numBuckets: Column, e: Column): Column
A transform for any type that partitions by a hash of the input column.
A transform for any type that partitions by a hash of the input column.
 Since
3.0.0

def
call_udf(udfName: String, cols: Column*): Column
Call an userdefined function.
Call an userdefined function. Example:
import org.apache.spark.sql._ val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value") val spark = df.sparkSession spark.udf.register("simpleUDF", (v: Int) => v * v) df.select($"id", call_udf("simpleUDF", $"value"))
 Annotations
 @varargs()
 Since
3.2.0

def
cbrt(columnName: String): Column
Computes the cuberoot of the given column.
Computes the cuberoot of the given column.
 Since
1.4.0

def
cbrt(e: Column): Column
Computes the cuberoot of the given value.
Computes the cuberoot of the given value.
 Since
1.4.0

def
ceil(columnName: String): Column
Computes the ceiling of the given value of
e
to 0 decimal places.Computes the ceiling of the given value of
e
to 0 decimal places. Since
1.4.0

def
ceil(e: Column): Column
Computes the ceiling of the given value of
e
to 0 decimal places.Computes the ceiling of the given value of
e
to 0 decimal places. Since
1.4.0

def
ceil(e: Column, scale: Column): Column
Computes the ceiling of the given value of
e
toscale
decimal places.Computes the ceiling of the given value of
e
toscale
decimal places. Since
3.3.0

def
clone(): AnyRef
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()

def
coalesce(e: Column*): Column
Returns the first column that is not null, or null if all inputs are null.
Returns the first column that is not null, or null if all inputs are null.
For example,
coalesce(a, b, c)
will return a if a is not null, or b if a is null and b is not null, or c if both a and b are null but c is not null. Annotations
 @varargs()
 Since
1.3.0

def
col(colName: String): Column
Returns a Column based on the given column name.
Returns a Column based on the given column name.
 Since
1.3.0

def
collect_list(columnName: String): Column
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
 Since
1.6.0
 Note
The function is nondeterministic because the order of collected results depends on the order of the rows which may be nondeterministic after a shuffle.

def
collect_list(e: Column): Column
Aggregate function: returns a list of objects with duplicates.
Aggregate function: returns a list of objects with duplicates.
 Since
1.6.0
 Note
The function is nondeterministic because the order of collected results depends on the order of the rows which may be nondeterministic after a shuffle.

def
collect_set(columnName: String): Column
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
 Since
1.6.0
 Note
The function is nondeterministic because the order of collected results depends on the order of the rows which may be nondeterministic after a shuffle.

def
collect_set(e: Column): Column
Aggregate function: returns a set of objects with duplicate elements eliminated.
Aggregate function: returns a set of objects with duplicate elements eliminated.
 Since
1.6.0
 Note
The function is nondeterministic because the order of collected results depends on the order of the rows which may be nondeterministic after a shuffle.

def
column(colName: String): Column
Returns a Column based on the given column name.

def
concat(exprs: Column*): Column
Concatenates multiple input columns together into a single column.
Concatenates multiple input columns together into a single column. The function works with strings, binary and compatible array columns.
 Annotations
 @varargs()
 Since
1.5.0

def
concat_ws(sep: String, exprs: Column*): Column
Concatenates multiple input string columns together into a single string column, using the given separator.
Concatenates multiple input string columns together into a single string column, using the given separator.
 Annotations
 @varargs()
 Since
1.5.0

def
conv(num: Column, fromBase: Int, toBase: Int): Column
Convert a number in a string column from one base to another.
Convert a number in a string column from one base to another.
 Since
1.5.0

def
corr(columnName1: String, columnName2: String): Column
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
 Since
1.6.0

def
corr(column1: Column, column2: Column): Column
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
Aggregate function: returns the Pearson Correlation Coefficient for two columns.
 Since
1.6.0

def
cos(columnName: String): Column
 columnName
angle in radians
 returns
cosine of the angle, as if computed by
java.lang.Math.cos
 Since
1.4.0

def
cos(e: Column): Column
 e
angle in radians
 returns
cosine of the angle, as if computed by
java.lang.Math.cos
 Since
1.4.0

def
cosh(columnName: String): Column
 columnName
hyperbolic angle
 returns
hyperbolic cosine of the angle, as if computed by
java.lang.Math.cosh
 Since
1.4.0

def
cosh(e: Column): Column
 e
hyperbolic angle
 returns
hyperbolic cosine of the angle, as if computed by
java.lang.Math.cosh
 Since
1.4.0

def
cot(e: Column): Column
 e
angle in radians
 returns
cotangent of the angle
 Since
3.3.0

def
count(columnName: String): TypedColumn[Any, Long]
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
 Since
1.3.0

def
count(e: Column): Column
Aggregate function: returns the number of items in a group.
Aggregate function: returns the number of items in a group.
 Since
1.3.0

def
countDistinct(columnName: String, columnNames: String*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
An alias of
count_distinct
, and it is encouraged to usecount_distinct
directly. Annotations
 @varargs()
 Since
1.3.0

def
countDistinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
An alias of
count_distinct
, and it is encouraged to usecount_distinct
directly. Annotations
 @varargs()
 Since
1.3.0

def
count_distinct(expr: Column, exprs: Column*): Column
Aggregate function: returns the number of distinct items in a group.
Aggregate function: returns the number of distinct items in a group.
 Annotations
 @varargs()
 Since
3.2.0

def
covar_pop(columnName1: String, columnName2: String): Column
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
 Since
2.0.0

def
covar_pop(column1: Column, column2: Column): Column
Aggregate function: returns the population covariance for two columns.
Aggregate function: returns the population covariance for two columns.
 Since
2.0.0

def
covar_samp(columnName1: String, columnName2: String): Column
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
 Since
2.0.0

def
covar_samp(column1: Column, column2: Column): Column
Aggregate function: returns the sample covariance for two columns.
Aggregate function: returns the sample covariance for two columns.
 Since
2.0.0

def
crc32(e: Column): Column
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
Calculates the cyclic redundancy check value (CRC32) of a binary column and returns the value as a bigint.
 Since
1.5.0

def
csc(e: Column): Column
 e
angle in radians
 returns
cosecant of the angle
 Since
3.3.0

def
cume_dist(): Column
Window function: returns the cumulative distribution of values within a window partition, i.e.
Window function: returns the cumulative distribution of values within a window partition, i.e. the fraction of rows that are below the current row.
N = total number of rows in the partition cumeDist(x) = number of values before (and including) x / N
 Since
1.6.0

def
current_date(): Column
Returns the current date at the start of query evaluation as a date column.
Returns the current date at the start of query evaluation as a date column. All calls of current_date within the same query return the same value.
 Since
1.5.0

def
current_timestamp(): Column
Returns the current timestamp at the start of query evaluation as a timestamp column.
Returns the current timestamp at the start of query evaluation as a timestamp column. All calls of current_timestamp within the same query return the same value.
 Since
1.5.0

def
date_add(start: Column, days: Column): Column
Returns the date that is
days
days afterstart
Returns the date that is
days
days afterstart
 start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 days
A column of the number of days to add to
start
, can be negative to subtract days returns
A date, or null if
start
was a string that could not be cast to a date
 Since
3.0.0

def
date_add(start: Column, days: Int): Column
Returns the date that is
days
days afterstart
Returns the date that is
days
days afterstart
 start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 days
The number of days to add to
start
, can be negative to subtract days returns
A date, or null if
start
was a string that could not be cast to a date
 Since
1.5.0

def
date_format(dateExpr: Column, format: String): Column
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument.
See Datetime Patterns for valid date and time format patterns
 dateExpr
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 format
A pattern
dd.MM.yyyy
would return a string like18.03.1993
 returns
A string, or null if
dateExpr
was a string that could not be cast to a timestamp
 Since
1.5.0
 Exceptions thrown
IllegalArgumentException
if theformat
pattern is invalid Note
Use specialized functions like year whenever possible as they benefit from a specialized implementation.

def
date_sub(start: Column, days: Column): Column
Returns the date that is
days
days beforestart
Returns the date that is
days
days beforestart
 start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 days
A column of the number of days to subtract from
start
, can be negative to add days returns
A date, or null if
start
was a string that could not be cast to a date
 Since
3.0.0

def
date_sub(start: Column, days: Int): Column
Returns the date that is
days
days beforestart
Returns the date that is
days
days beforestart
 start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 days
The number of days to subtract from
start
, can be negative to add days returns
A date, or null if
start
was a string that could not be cast to a date
 Since
1.5.0

def
date_trunc(format: String, timestamp: Column): Column
Returns timestamp truncated to the unit specified by the format.
Returns timestamp truncated to the unit specified by the format.
For example,
date_trunc("year", "20181119 12:01:19")
returns 20180101 00:00:00 timestamp
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
A timestamp, or null if
timestamp
was a string that could not be cast to a timestamp orformat
was an invalid value
 Since
2.3.0

def
datediff(end: Column, start: Column): Column
Returns the number of days from
start
toend
.Returns the number of days from
start
toend
.Only considers the date part of the input. For example:
dateddiff("20180110 00:00:00", "20180109 23:59:59") // returns 1
 end
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 start
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
An integer, or null if either
end
orstart
were strings that could not be cast to a date. Negative ifend
is beforestart
 Since
1.5.0

def
dayofmonth(e: Column): Column
Extracts the day of the month as an integer from a given date/timestamp/string.
Extracts the day of the month as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
dayofweek(e: Column): Column
Extracts the day of the week as an integer from a given date/timestamp/string.
Extracts the day of the week as an integer from a given date/timestamp/string. Ranges from 1 for a Sunday through to 7 for a Saturday
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
2.3.0

def
dayofyear(e: Column): Column
Extracts the day of the year as an integer from a given date/timestamp/string.
Extracts the day of the year as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
days(e: Column): Column
A transform for timestamps and dates to partition data into days.
A transform for timestamps and dates to partition data into days.
 Since
3.0.0

def
decode(value: Column, charset: String): Column
Computes the first argument into a string from a binary using the provided character set (one of 'USASCII', 'ISO88591', 'UTF8', 'UTF16BE', 'UTF16LE', 'UTF16').
Computes the first argument into a string from a binary using the provided character set (one of 'USASCII', 'ISO88591', 'UTF8', 'UTF16BE', 'UTF16LE', 'UTF16'). If either argument is null, the result will also be null.
 Since
1.5.0

def
degrees(columnName: String): Column
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
 columnName
angle in radians
 returns
angle in degrees, as if computed by
java.lang.Math.toDegrees
 Since
2.1.0

def
degrees(e: Column): Column
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
Converts an angle measured in radians to an approximately equivalent angle measured in degrees.
 e
angle in radians
 returns
angle in degrees, as if computed by
java.lang.Math.toDegrees
 Since
2.1.0

def
dense_rank(): Column
Window function: returns the rank of rows within a window partition, without any gaps.
Window function: returns the rank of rows within a window partition, without any gaps.
The difference between rank and dense_rank is that denseRank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the DENSE_RANK function in SQL.
 Since
1.6.0

def
desc(columnName: String): Column
Returns a sort expression based on the descending order of the column.
Returns a sort expression based on the descending order of the column.
df.sort(asc("dept"), desc("age"))
 Since
1.3.0

def
desc_nulls_first(columnName: String): Column
Returns a sort expression based on the descending order of the column, and null values appear before nonnull values.
Returns a sort expression based on the descending order of the column, and null values appear before nonnull values.
df.sort(asc("dept"), desc_nulls_first("age"))
 Since
2.1.0

def
desc_nulls_last(columnName: String): Column
Returns a sort expression based on the descending order of the column, and null values appear after nonnull values.
Returns a sort expression based on the descending order of the column, and null values appear after nonnull values.
df.sort(asc("dept"), desc_nulls_last("age"))
 Since
2.1.0

def
element_at(column: Column, value: Any): Column
Returns element of array at given index in value if column is array.
Returns element of array at given index in value if column is array. Returns value for the given key in value if column is map.
 Since
2.4.0

def
encode(value: Column, charset: String): Column
Computes the first argument into a binary from a string using the provided character set (one of 'USASCII', 'ISO88591', 'UTF8', 'UTF16BE', 'UTF16LE', 'UTF16').
Computes the first argument into a binary from a string using the provided character set (one of 'USASCII', 'ISO88591', 'UTF8', 'UTF16BE', 'UTF16LE', 'UTF16'). If either argument is null, the result will also be null.
 Since
1.5.0

final
def
eq(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

def
equals(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
exists(column: Column, f: (Column) ⇒ Column): Column
Returns whether a predicate holds for one or more elements in the array.
Returns whether a predicate holds for one or more elements in the array.
df.select(exists(col("i"), _ % 2 === 0))
 column
the input array column
 f
col => predicate, the Boolean predicate to check the input column
 Since
3.0.0

def
exp(columnName: String): Column
Computes the exponential of the given column.
Computes the exponential of the given column.
 Since
1.4.0

def
exp(e: Column): Column
Computes the exponential of the given value.
Computes the exponential of the given value.
 Since
1.4.0

def
explode(e: Column): Column
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column. Uses the default column name
col
for elements in the array andkey
andvalue
for elements in the map unless specified otherwise. Since
1.3.0

def
explode_outer(e: Column): Column
Creates a new row for each element in the given array or map column.
Creates a new row for each element in the given array or map column. Uses the default column name
col
for elements in the array andkey
andvalue
for elements in the map unless specified otherwise. Unlike explode, if the array/map is null or empty then null is produced. Since
2.2.0

def
expm1(columnName: String): Column
Computes the exponential of the given column minus one.
Computes the exponential of the given column minus one.
 Since
1.4.0

def
expm1(e: Column): Column
Computes the exponential of the given value minus one.
Computes the exponential of the given value minus one.
 Since
1.4.0

def
expr(expr: String): Column
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
Parses the expression string into the column that it represents, similar to Dataset#selectExpr.
// get the number of words of each length df.groupBy(expr("length(word)")).count()

def
factorial(e: Column): Column
Computes the factorial of the given value.
Computes the factorial of the given value.
 Since
1.5.0

def
filter(column: Column, f: (Column, Column) ⇒ Column): Column
Returns an array of elements for which a predicate holds in a given array.
Returns an array of elements for which a predicate holds in a given array.
df.select(filter(col("s"), (x, i) => i % 2 === 0))
 column
the input array column
 f
(col, index) => predicate, the Boolean predicate to filter the input column given the index. Indices start at 0.
 Since
3.0.0

def
filter(column: Column, f: (Column) ⇒ Column): Column
Returns an array of elements for which a predicate holds in a given array.
Returns an array of elements for which a predicate holds in a given array.
df.select(filter(col("s"), x => x % 2 === 0))
 column
the input array column
 f
col => predicate, the Boolean predicate to filter the input column
 Since
3.0.0

def
finalize(): Unit
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( classOf[java.lang.Throwable] )

def
first(columnName: String): Column
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
1.3.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
first(e: Column): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
1.3.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
first(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the first value of a column in a group.
Aggregate function: returns the first value of a column in a group.
The function by default returns the first values it sees. It will return the first nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
2.0.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
first(e: Column, ignoreNulls: Boolean): Column
Aggregate function: returns the first value in a group.
Aggregate function: returns the first value in a group.
The function by default returns the first values it sees. It will return the first nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
2.0.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
flatten(e: Column): Column
Creates a single array from an array of arrays.
Creates a single array from an array of arrays. If a structure of nested arrays is deeper than two levels, only one level of nesting is removed.
 Since
2.4.0

def
floor(columnName: String): Column
Computes the floor of the given column value to 0 decimal places.
Computes the floor of the given column value to 0 decimal places.
 Since
1.4.0

def
floor(e: Column): Column
Computes the floor of the given value of
e
to 0 decimal places.Computes the floor of the given value of
e
to 0 decimal places. Since
1.4.0

def
floor(e: Column, scale: Column): Column
Computes the floor of the given value of
e
toscale
decimal places.Computes the floor of the given value of
e
toscale
decimal places. Since
3.3.0

def
forall(column: Column, f: (Column) ⇒ Column): Column
Returns whether a predicate holds for every element in the array.
Returns whether a predicate holds for every element in the array.
df.select(forall(col("i"), x => x % 2 === 0))
 column
the input array column
 f
col => predicate, the Boolean predicate to check the input column
 Since
3.0.0

def
format_number(x: Column, d: Int): Column
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places with HALF_EVEN round mode, and returns the result as a string column.
If d is 0, the result has no decimal point or fractional part. If d is less than 0, the result will be null.
 Since
1.5.0

def
format_string(format: String, arguments: Column*): Column
Formats the arguments in printfstyle and returns the result as a string column.
Formats the arguments in printfstyle and returns the result as a string column.
 Annotations
 @varargs()
 Since
1.5.0

def
from_csv(e: Column, schema: Column, options: Map[String, String]): Column
(Javaspecific) Parses a column containing a CSV string into a
StructType
with the specified schema.(Javaspecific) Parses a column containing a CSV string into a
StructType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing CSV data.
 schema
the schema to use when parsing the CSV string
 options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
 Since
3.0.0

def
from_csv(e: Column, schema: StructType, options: Map[String, String]): Column
Parses a column containing a CSV string into a
StructType
with the specified schema.Parses a column containing a CSV string into a
StructType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing CSV data.
 schema
the schema to use when parsing the CSV string
 options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
 Since
3.0.0

def
from_json(e: Column, schema: Column, options: Map[String, String]): Column
(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
ofStructType
s with the specified schema.(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
ofStructType
s with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.4.0

def
from_json(e: Column, schema: Column): Column
(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
ofStructType
s with the specified schema.(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
ofStructType
s with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 Since
2.4.0

def
from_json(e: Column, schema: String, options: Map[String, String]): Column
(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema.(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema as a DDLformatted string.
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.3.0

def
from_json(e: Column, schema: String, options: Map[String, String]): Column
(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema.(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema as a DDLformatted string.
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.1.0

def
from_json(e: Column, schema: DataType): Column
Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema.Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 Since
2.2.0

def
from_json(e: Column, schema: StructType): Column
Parses a column containing a JSON string into a
StructType
with the specified schema.Parses a column containing a JSON string into a
StructType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 Since
2.1.0

def
from_json(e: Column, schema: DataType, options: Map[String, String]): Column
(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema.(Javaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.2.0

def
from_json(e: Column, schema: StructType, options: Map[String, String]): Column
(Javaspecific) Parses a column containing a JSON string into a
StructType
with the specified schema.(Javaspecific) Parses a column containing a JSON string into a
StructType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.1.0

def
from_json(e: Column, schema: DataType, options: Map[String, String]): Column
(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema.(Scalaspecific) Parses a column containing a JSON string into a
MapType
withStringType
as keys type,StructType
orArrayType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 Since
2.2.0

def
from_json(e: Column, schema: StructType, options: Map[String, String]): Column
(Scalaspecific) Parses a column containing a JSON string into a
StructType
with the specified schema.(Scalaspecific) Parses a column containing a JSON string into a
StructType
with the specified schema. Returnsnull
, in the case of an unparseable string. e
a string column containing JSON data.
 schema
the schema to use when parsing the json string
 options
options to control how the json is parsed. Accepts the same options as the json data source. See Data Source Option in the version you use.
 Since
2.1.0

def
from_unixtime(ut: Column, f: String): Column
Converts the number of seconds from unix epoch (19700101 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
Converts the number of seconds from unix epoch (19700101 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the given format.
See Datetime Patterns for valid date and time format patterns
 ut
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
 f
A date time pattern that the input will be formatted to
 returns
A string, or null if
ut
was a string that could not be cast to a long orf
was an invalid date time pattern
 Since
1.5.0

def
from_unixtime(ut: Column): Column
Converts the number of seconds from unix epoch (19700101 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyyMMdd HH:mm:ss format.
Converts the number of seconds from unix epoch (19700101 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyyMMdd HH:mm:ss format.
 ut
A number of a type that is castable to a long, such as string or integer. Can be negative for timestamps before the unix epoch
 returns
A string, or null if the input was a string that could not be cast to a long
 Since
1.5.0

def
from_utc_timestamp(ts: Column, tz: Column): Column
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '20170714 03:40:00.0'.
 Since
2.4.0

def
from_utc_timestamp(ts: Column, tz: String): Column
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone.
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in UTC, and renders that time as a timestamp in the given time zone. For example, 'GMT+1' would yield '20170714 03:40:00.0'.
 ts
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 tz
A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either regionbased zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+)HH:mm', for example '08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.
 returns
A timestamp, or null if
ts
was a string that could not be cast to a timestamp ortz
was an invalid value
 Since
1.5.0

final
def
getClass(): Class[_]
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

def
get_json_object(e: Column, path: String): Column
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object.
Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.
 Since
1.6.0

def
greatest(columnName: String, columnNames: String*): Column
Returns the greatest value of the list of column names, skipping null values.
Returns the greatest value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
 Annotations
 @varargs()
 Since
1.5.0

def
greatest(exprs: Column*): Column
Returns the greatest value of the list of values, skipping null values.
Returns the greatest value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
 Annotations
 @varargs()
 Since
1.5.0

def
grouping(columnName: String): Column
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
 Since
2.0.0

def
grouping(e: Column): Column
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
Aggregate function: indicates whether a specified column in a GROUP BY list is aggregated or not, returns 1 for aggregated or 0 for not aggregated in the result set.
 Since
2.0.0

def
grouping_id(colName: String, colNames: String*): Column
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n1)) + (grouping(c2) <<; (n2)) + ... + grouping(cn)
 Since
2.0.0
 Note
The list of columns should match with grouping columns exactly.

def
grouping_id(cols: Column*): Column
Aggregate function: returns the level of grouping, equals to
Aggregate function: returns the level of grouping, equals to
(grouping(c1) <<; (n1)) + (grouping(c2) <<; (n2)) + ... + grouping(cn)
 Since
2.0.0
 Note
The list of columns should match with grouping columns exactly, or empty (means all the grouping columns).

def
hash(cols: Column*): Column
Calculates the hash code of given columns, and returns the result as an int column.
Calculates the hash code of given columns, and returns the result as an int column.
 Annotations
 @varargs()
 Since
2.0.0

def
hashCode(): Int
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

def
hex(column: Column): Column
Computes hex value of the given column.
Computes hex value of the given column.
 Since
1.5.0

def
hour(e: Column): Column
Extracts the hours as an integer from a given date/timestamp/string.
Extracts the hours as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
hours(e: Column): Column
A transform for timestamps to partition data into hours.
A transform for timestamps to partition data into hours.
 Since
3.0.0

def
hypot(l: Double, rightName: String): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(l: Double, r: Column): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(leftName: String, r: Double): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(l: Column, r: Double): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(leftName: String, rightName: String): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(leftName: String, r: Column): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(l: Column, rightName: String): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
hypot(l: Column, r: Column): Column
Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow.Computes
sqrt(a^{2} + b^{2})
without intermediate overflow or underflow. Since
1.4.0

def
initcap(e: Column): Column
Returns a new string column by converting the first letter of each word to uppercase.
Returns a new string column by converting the first letter of each word to uppercase. Words are delimited by whitespace.
For example, "hello world" will become "Hello World".
 Since
1.5.0

def
input_file_name(): Column
Creates a string column for the file name of the current Spark task.
Creates a string column for the file name of the current Spark task.
 Since
1.6.0

def
instr(str: Column, substring: String): Column
Locate the position of the first occurrence of substr column in the given string.
Locate the position of the first occurrence of substr column in the given string. Returns null if either of the arguments are null.
 Since
1.5.0
 Note
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

final
def
isInstanceOf[T0]: Boolean
 Definition Classes
 Any

def
isnan(e: Column): Column
Return true iff the column is NaN.
Return true iff the column is NaN.
 Since
1.6.0

def
isnull(e: Column): Column
Return true iff the column is null.
Return true iff the column is null.
 Since
1.6.0

def
json_tuple(json: Column, fields: String*): Column
Creates a new row for a json column according to the given field names.
Creates a new row for a json column according to the given field names.
 Annotations
 @varargs()
 Since
1.6.0

def
kurtosis(columnName: String): Column
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
 Since
1.6.0

def
kurtosis(e: Column): Column
Aggregate function: returns the kurtosis of the values in a group.
Aggregate function: returns the kurtosis of the values in a group.
 Since
1.6.0

def
lag(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row.Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row.ignoreNulls
determines whether null values of row are included in or eliminated from the calculation. For example, anoffset
of one will return the previous row at any given point in the window partition.This is equivalent to the LAG function in SQL.
 Since
3.2.0

def
lag(e: Column, offset: Int, defaultValue: Any): Column
Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row.Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row. For example, anoffset
of one will return the previous row at any given point in the window partition.This is equivalent to the LAG function in SQL.
 Since
1.4.0

def
lag(columnName: String, offset: Int, defaultValue: Any): Column
Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row.Window function: returns the value that is
offset
rows before the current row, anddefaultValue
if there is less thanoffset
rows before the current row. For example, anoffset
of one will return the previous row at any given point in the window partition.This is equivalent to the LAG function in SQL.
 Since
1.4.0

def
lag(columnName: String, offset: Int): Column
Window function: returns the value that is
offset
rows before the current row, andnull
if there is less thanoffset
rows before the current row.Window function: returns the value that is
offset
rows before the current row, andnull
if there is less thanoffset
rows before the current row. For example, anoffset
of one will return the previous row at any given point in the window partition.This is equivalent to the LAG function in SQL.
 Since
1.4.0

def
lag(e: Column, offset: Int): Column
Window function: returns the value that is
offset
rows before the current row, andnull
if there is less thanoffset
rows before the current row.Window function: returns the value that is
offset
rows before the current row, andnull
if there is less thanoffset
rows before the current row. For example, anoffset
of one will return the previous row at any given point in the window partition.This is equivalent to the LAG function in SQL.
 Since
1.4.0

def
last(columnName: String): Column
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
1.3.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
last(e: Column): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
1.3.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
last(columnName: String, ignoreNulls: Boolean): Column
Aggregate function: returns the last value of the column in a group.
Aggregate function: returns the last value of the column in a group.
The function by default returns the last values it sees. It will return the last nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
2.0.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
last(e: Column, ignoreNulls: Boolean): Column
Aggregate function: returns the last value in a group.
Aggregate function: returns the last value in a group.
The function by default returns the last values it sees. It will return the last nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.
 Since
2.0.0
 Note
The function is nondeterministic because its results depends on the order of the rows which may be nondeterministic after a shuffle.

def
last_day(e: Column): Column
Returns the last day of the month which the given date belongs to.
Returns the last day of the month which the given date belongs to. For example, input "20150727" returns "20150731" since July 31 is the last day of the month in July 2015.
 e
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
A date, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
lead(e: Column, offset: Int, defaultValue: Any, ignoreNulls: Boolean): Column
Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row.Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row.ignoreNulls
determines whether null values of row are included in or eliminated from the calculation. The default value ofignoreNulls
is false. For example, anoffset
of one will return the next row at any given point in the window partition.This is equivalent to the LEAD function in SQL.
 Since
3.2.0

def
lead(e: Column, offset: Int, defaultValue: Any): Column
Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row.Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row. For example, anoffset
of one will return the next row at any given point in the window partition.This is equivalent to the LEAD function in SQL.
 Since
1.4.0

def
lead(columnName: String, offset: Int, defaultValue: Any): Column
Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row.Window function: returns the value that is
offset
rows after the current row, anddefaultValue
if there is less thanoffset
rows after the current row. For example, anoffset
of one will return the next row at any given point in the window partition.This is equivalent to the LEAD function in SQL.
 Since
1.4.0

def
lead(e: Column, offset: Int): Column
Window function: returns the value that is
offset
rows after the current row, andnull
if there is less thanoffset
rows after the current row.Window function: returns the value that is
offset
rows after the current row, andnull
if there is less thanoffset
rows after the current row. For example, anoffset
of one will return the next row at any given point in the window partition.This is equivalent to the LEAD function in SQL.
 Since
1.4.0

def
lead(columnName: String, offset: Int): Column
Window function: returns the value that is
offset
rows after the current row, andnull
if there is less thanoffset
rows after the current row.Window function: returns the value that is
offset
rows after the current row, andnull
if there is less thanoffset
rows after the current row. For example, anoffset
of one will return the next row at any given point in the window partition.This is equivalent to the LEAD function in SQL.
 Since
1.4.0

def
least(columnName: String, columnNames: String*): Column
Returns the least value of the list of column names, skipping null values.
Returns the least value of the list of column names, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
 Annotations
 @varargs()
 Since
1.5.0

def
least(exprs: Column*): Column
Returns the least value of the list of values, skipping null values.
Returns the least value of the list of values, skipping null values. This function takes at least 2 parameters. It will return null iff all parameters are null.
 Annotations
 @varargs()
 Since
1.5.0

def
length(e: Column): Column
Computes the character length of a given string or number of bytes of a binary string.
Computes the character length of a given string or number of bytes of a binary string. The length of character strings include the trailing spaces. The length of binary strings includes binary zeros.
 Since
1.5.0

def
levenshtein(l: Column, r: Column): Column
Computes the Levenshtein distance of the two given string columns.
Computes the Levenshtein distance of the two given string columns.
 Since
1.5.0

def
lit(literal: Any): Column
Creates a Column of literal value.

def
localtimestamp(): Column
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column.
Returns the current timestamp without time zone at the start of query evaluation as a timestamp without time zone column. All calls of localtimestamp within the same query return the same value.
 Since
3.3.0

def
locate(substr: String, str: Column, pos: Int): Column
Locate the position of the first occurrence of substr in a string column, after position pos.
Locate the position of the first occurrence of substr in a string column, after position pos.
 Since
1.5.0
 Note
The position is not zero based, but 1 based index. returns 0 if substr could not be found in str.

def
locate(substr: String, str: Column): Column
Locate the position of the first occurrence of substr.
Locate the position of the first occurrence of substr.
 Since
1.5.0
 Note
The position is not zero based, but 1 based index. Returns 0 if substr could not be found in str.

def
log(base: Double, columnName: String): Column
Returns the first argumentbase logarithm of the second argument.
Returns the first argumentbase logarithm of the second argument.
 Since
1.4.0

def
log(base: Double, a: Column): Column
Returns the first argumentbase logarithm of the second argument.
Returns the first argumentbase logarithm of the second argument.
 Since
1.4.0

def
log(columnName: String): Column
Computes the natural logarithm of the given column.
Computes the natural logarithm of the given column.
 Since
1.4.0

def
log(e: Column): Column
Computes the natural logarithm of the given value.
Computes the natural logarithm of the given value.
 Since
1.4.0

def
log10(columnName: String): Column
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
 Since
1.4.0

def
log10(e: Column): Column
Computes the logarithm of the given value in base 10.
Computes the logarithm of the given value in base 10.
 Since
1.4.0

def
log1p(columnName: String): Column
Computes the natural logarithm of the given column plus one.
Computes the natural logarithm of the given column plus one.
 Since
1.4.0

def
log1p(e: Column): Column
Computes the natural logarithm of the given value plus one.
Computes the natural logarithm of the given value plus one.
 Since
1.4.0

def
log2(columnName: String): Column
Computes the logarithm of the given value in base 2.
Computes the logarithm of the given value in base 2.
 Since
1.5.0

def
log2(expr: Column): Column
Computes the logarithm of the given column in base 2.
Computes the logarithm of the given column in base 2.
 Since
1.5.0

def
lower(e: Column): Column
Converts a string column to lower case.
Converts a string column to lower case.
 Since
1.3.0

def
lpad(str: Column, len: Int, pad: Array[Byte]): Column
Leftpad the binary column with pad to a byte length of len.
Leftpad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.
 Since
3.3.0

def
lpad(str: Column, len: Int, pad: String): Column
Leftpad the string column with pad to a length of len.
Leftpad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
 Since
1.5.0

def
ltrim(e: Column, trimString: String): Column
Trim the specified character string from left end for the specified string column.
Trim the specified character string from left end for the specified string column.
 Since
2.3.0

def
ltrim(e: Column): Column
Trim the spaces from left end for the specified string value.
Trim the spaces from left end for the specified string value.
 Since
1.5.0

def
make_date(year: Column, month: Column, day: Column): Column
 returns
A date created from year, month and day fields.
 Since
3.3.0

def
map(cols: Column*): Column
Creates a new map column.
Creates a new map column. The input columns must be grouped as keyvalue pairs, e.g. (key1, value1, key2, value2, ...). The key columns must all have the same data type, and can't be null. The value columns must all have the same data type.
 Annotations
 @varargs()
 Since
2.0

def
map_concat(cols: Column*): Column
Returns the union of all the given maps.
Returns the union of all the given maps.
 Annotations
 @varargs()
 Since
2.4.0

def
map_contains_key(column: Column, key: Any): Column
Returns true if the map contains the key.
Returns true if the map contains the key.
 Since
3.3.0

def
map_entries(e: Column): Column
Returns an unordered array of all entries in the given map.
Returns an unordered array of all entries in the given map.
 Since
3.0.0

def
map_filter(expr: Column, f: (Column, Column) ⇒ Column): Column
Returns a map whose keyvalue pairs satisfy a predicate.
Returns a map whose keyvalue pairs satisfy a predicate.
df.select(map_filter(col("m"), (k, v) => k * 10 === v))
 expr
the input map column
 f
(key, value) => predicate, the Boolean predicate to filter the input map column
 Since
3.0.0

def
map_from_arrays(keys: Column, values: Column): Column
Creates a new map column.
Creates a new map column. The array in the first column is used for keys. The array in the second column is used for values. All elements in the array for key should not be null.
 Since
2.4

def
map_from_entries(e: Column): Column
Returns a map created from the given array of entries.
Returns a map created from the given array of entries.
 Since
2.4.0

def
map_keys(e: Column): Column
Returns an unordered array containing the keys of the map.
Returns an unordered array containing the keys of the map.
 Since
2.3.0

def
map_values(e: Column): Column
Returns an unordered array containing the values of the map.
Returns an unordered array containing the values of the map.
 Since
2.3.0

def
map_zip_with(left: Column, right: Column, f: (Column, Column, Column) ⇒ Column): Column
Merge two given maps, keywise into a single map using a function.
Merge two given maps, keywise into a single map using a function.
df.select(map_zip_with(df("m1"), df("m2"), (k, v1, v2) => k === v1 + v2))
 left
the left input map column
 right
the right input map column
 f
(key, value1, value2) => new_value, the lambda function to merge the map values
 Since
3.0.0

def
max(columnName: String): Column
Aggregate function: returns the maximum value of the column in a group.
Aggregate function: returns the maximum value of the column in a group.
 Since
1.3.0

def
max(e: Column): Column
Aggregate function: returns the maximum value of the expression in a group.
Aggregate function: returns the maximum value of the expression in a group.
 Since
1.3.0

def
max_by(e: Column, ord: Column): Column
Aggregate function: returns the value associated with the maximum value of ord.
Aggregate function: returns the value associated with the maximum value of ord.
 Since
3.3.0

def
md5(e: Column): Column
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
Calculates the MD5 digest of a binary column and returns the value as a 32 character hex string.
 Since
1.5.0

def
mean(columnName: String): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
 Since
1.4.0

def
mean(e: Column): Column
Aggregate function: returns the average of the values in a group.
Aggregate function: returns the average of the values in a group. Alias for avg.
 Since
1.4.0

def
min(columnName: String): Column
Aggregate function: returns the minimum value of the column in a group.
Aggregate function: returns the minimum value of the column in a group.
 Since
1.3.0

def
min(e: Column): Column
Aggregate function: returns the minimum value of the expression in a group.
Aggregate function: returns the minimum value of the expression in a group.
 Since
1.3.0

def
min_by(e: Column, ord: Column): Column
Aggregate function: returns the value associated with the minimum value of ord.
Aggregate function: returns the value associated with the minimum value of ord.
 Since
3.3.0

def
minute(e: Column): Column
Extracts the minutes as an integer from a given date/timestamp/string.
Extracts the minutes as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
monotonically_increasing_id(): Column
A column expression that generates monotonically increasing 64bit integers.
A column expression that generates monotonically increasing 64bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a
DataFrame
with two partitions, each with 3 records. This expression would return the following IDs:0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
 Since
1.6.0

def
month(e: Column): Column
Extracts the month as an integer from a given date/timestamp/string.
Extracts the month as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
months(e: Column): Column
A transform for timestamps and dates to partition data into months.
A transform for timestamps and dates to partition data into months.
 Since
3.0.0

def
months_between(end: Column, start: Column, roundOff: Boolean): Column
Returns number of months between dates
end
andstart
.Returns number of months between dates
end
andstart
. IfroundOff
is set to true, the result is rounded off to 8 digits; it is not rounded otherwise. Since
2.4.0

def
months_between(end: Column, start: Column): Column
Returns number of months between dates
start
andend
.Returns number of months between dates
start
andend
.A whole number is returned if both inputs have the same day of month or both are the last day of their respective months. Otherwise, the difference is calculated assuming 31 days per month.
For example:
months_between("20171114", "20170714") // returns 4.0 months_between("20170101", "20170110") // returns 0.29032258 months_between("20170601", "20170616 12:00:00") // returns 0.5
 end
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 start
A date, timestamp or string. If a string, the data must be in a format that can cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
A double, or null if either
end
orstart
were strings that could not be cast to a timestamp. Negative ifend
is beforestart
 Since
1.5.0

def
nanvl(col1: Column, col2: Column): Column
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Returns col1 if it is not NaN, or col2 if col1 is NaN.
Both inputs should be floating point columns (DoubleType or FloatType).
 Since
1.5.0

final
def
ne(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

def
negate(e: Column): Column
Unary minus, i.e.
Unary minus, i.e. negate the expression.
// Select the amount column and negates all values. // Scala: df.select( df("amount") ) // Java: df.select( negate(df.col("amount")) );
 Since
1.3.0

def
next_day(date: Column, dayOfWeek: Column): Column
Returns the first date which is later than the value of the
date
column that is on the specified day of the week.Returns the first date which is later than the value of the
date
column that is on the specified day of the week.For example,
next_day('20150727', "Sunday")
returns 20150802 because that is the first Sunday after 20150727. date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 dayOfWeek
A column of the day of week. Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
 returns
A date, or null if
date
was a string that could not be cast to a date or ifdayOfWeek
was an invalid value
 Since
3.2.0

def
next_day(date: Column, dayOfWeek: String): Column
Returns the first date which is later than the value of the
date
column that is on the specified day of the week.Returns the first date which is later than the value of the
date
column that is on the specified day of the week.For example,
next_day('20150727', "Sunday")
returns 20150802 because that is the first Sunday after 20150727. date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 dayOfWeek
Case insensitive, and accepts: "Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"
 returns
A date, or null if
date
was a string that could not be cast to a date or ifdayOfWeek
was an invalid value
 Since
1.5.0

def
not(e: Column): Column
Inversion of boolean expression, i.e.
Inversion of boolean expression, i.e. NOT.
// Scala: select rows that are not active (isActive === false) df.filter( !df("isActive") ) // Java: df.filter( not(df.col("isActive")) );
 Since
1.3.0

final
def
notify(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

final
def
notifyAll(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

def
nth_value(e: Column, offset: Int): Column
Window function: returns the value that is the
offset
th row of the window frame (counting from 1), andnull
if the size of window frame is less thanoffset
rows.Window function: returns the value that is the
offset
th row of the window frame (counting from 1), andnull
if the size of window frame is less thanoffset
rows.This is equivalent to the nth_value function in SQL.
 Since
3.1.0

def
nth_value(e: Column, offset: Int, ignoreNulls: Boolean): Column
Window function: returns the value that is the
offset
th row of the window frame (counting from 1), andnull
if the size of window frame is less thanoffset
rows.Window function: returns the value that is the
offset
th row of the window frame (counting from 1), andnull
if the size of window frame is less thanoffset
rows.It will return the
offset
th nonnull value it sees when ignoreNulls is set to true. If all values are null, then null is returned.This is equivalent to the nth_value function in SQL.
 Since
3.1.0

def
ntile(n: Int): Column
Window function: returns the ntile group id (from 1 to
n
inclusive) in an ordered window partition.Window function: returns the ntile group id (from 1 to
n
inclusive) in an ordered window partition. For example, ifn
is 4, the first quarter of the rows will get value 1, the second quarter will get 2, the third quarter will get 3, and the last quarter will get 4.This is equivalent to the NTILE function in SQL.
 Since
1.4.0

def
octet_length(e: Column): Column
Calculates the byte length for the specified string column.
Calculates the byte length for the specified string column.
 Since
3.3.0

def
overlay(src: Column, replace: Column, pos: Column): Column
Overlay the specified portion of
src
withreplace
, starting from byte positionpos
ofsrc
.Overlay the specified portion of
src
withreplace
, starting from byte positionpos
ofsrc
. Since
3.0.0

def
overlay(src: Column, replace: Column, pos: Column, len: Column): Column
Overlay the specified portion of
src
withreplace
, starting from byte positionpos
ofsrc
and proceeding forlen
bytes.Overlay the specified portion of
src
withreplace
, starting from byte positionpos
ofsrc
and proceeding forlen
bytes. Since
3.0.0

def
percent_rank(): Column
Window function: returns the relative rank (i.e.
Window function: returns the relative rank (i.e. percentile) of rows within a window partition.
This is computed by:
(rank of row in its partition  1) / (number of rows in the partition  1)
This is equivalent to the PERCENT_RANK function in SQL.
 Since
1.6.0

def
percentile_approx(e: Column, percentage: Column, accuracy: Column): Column
Aggregate function: returns the approximate
percentile
of the numeric columncol
which is the smallest value in the orderedcol
values (sorted from least to greatest) such that no more thanpercentage
ofcol
values is less than the value or equal to that value.Aggregate function: returns the approximate
percentile
of the numeric columncol
which is the smallest value in the orderedcol
values (sorted from least to greatest) such that no more thanpercentage
ofcol
values is less than the value or equal to that value.If percentage is an array, each value must be between 0.0 and 1.0. If it is a single floating point value, it must be between 0.0 and 1.0.
The accuracy parameter is a positive numeric literal which controls approximation accuracy at the cost of memory. Higher value of accuracy yields better accuracy, 1.0/accuracy is the relative error of the approximation.
 Since
3.1.0

def
pmod(dividend: Column, divisor: Column): Column
Returns the positive value of dividend mod divisor.
Returns the positive value of dividend mod divisor.
 Since
1.5.0

def
posexplode(e: Column): Column
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column. Uses the default column name
pos
for position, andcol
for elements in the array andkey
andvalue
for elements in the map unless specified otherwise. Since
2.1.0

def
posexplode_outer(e: Column): Column
Creates a new row for each element with position in the given array or map column.
Creates a new row for each element with position in the given array or map column. Uses the default column name
pos
for position, andcol
for elements in the array andkey
andvalue
for elements in the map unless specified otherwise. Unlike posexplode, if the array/map is null or empty then the row (null, null) is produced. Since
2.2.0

def
pow(l: Double, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(l: Double, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(leftName: String, r: Double): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(l: Column, r: Double): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(leftName: String, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(leftName: String, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(l: Column, rightName: String): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
pow(l: Column, r: Column): Column
Returns the value of the first argument raised to the power of the second argument.
Returns the value of the first argument raised to the power of the second argument.
 Since
1.4.0

def
product(e: Column): Column
Aggregate function: returns the product of all numerical elements in a group.
Aggregate function: returns the product of all numerical elements in a group.
 Since
3.2.0

def
quarter(e: Column): Column
Extracts the quarter as an integer from a given date/timestamp/string.
Extracts the quarter as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
radians(columnName: String): Column
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
 columnName
angle in degrees
 returns
angle in radians, as if computed by
java.lang.Math.toRadians
 Since
2.1.0

def
radians(e: Column): Column
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
Converts an angle measured in degrees to an approximately equivalent angle measured in radians.
 e
angle in degrees
 returns
angle in radians, as if computed by
java.lang.Math.toRadians
 Since
2.1.0

def
raise_error(c: Column): Column
Throws an exception with the provided error message.
Throws an exception with the provided error message.
 Since
3.1.0

def
rand(): Column
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
 Since
1.4.0
 Note
The function is nondeterministic in general case.

def
rand(seed: Long): Column
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
Generate a random column with independent and identically distributed (i.i.d.) samples uniformly distributed in [0.0, 1.0).
 Since
1.4.0
 Note
The function is nondeterministic in general case.

def
randn(): Column
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
 Since
1.4.0
 Note
The function is nondeterministic in general case.

def
randn(seed: Long): Column
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.
 Since
1.4.0
 Note
The function is nondeterministic in general case.

def
rank(): Column
Window function: returns the rank of rows within a window partition.
Window function: returns the rank of rows within a window partition.
The difference between rank and dense_rank is that dense_rank leaves no gaps in ranking sequence when there are ties. That is, if you were ranking a competition using dense_rank and had three people tie for second place, you would say that all three were in second place and that the next person came in third. Rank would give me sequential numbers, making the person that came in third place (after the ties) would register as coming in fifth.
This is equivalent to the RANK function in SQL.
 Since
1.4.0

def
regexp_extract(e: Column, exp: String, groupIdx: Int): Column
Extract a specific group matched by a Java regex, from the specified string column.
Extract a specific group matched by a Java regex, from the specified string column. If the regex did not match, or the specified group did not match, an empty string is returned. if the specified group index exceeds the group count of regex, an IllegalArgumentException will be thrown.
 Since
1.5.0

def
regexp_replace(e: Column, pattern: Column, replacement: Column): Column
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
 Since
2.1.0

def
regexp_replace(e: Column, pattern: String, replacement: String): Column
Replace all substrings of the specified string value that match regexp with rep.
Replace all substrings of the specified string value that match regexp with rep.
 Since
1.5.0

def
repeat(str: Column, n: Int): Column
Repeats a string column n times, and returns it as a new string column.
Repeats a string column n times, and returns it as a new string column.
 Since
1.5.0

def
reverse(e: Column): Column
Returns a reversed string or an array with reverse order of elements.
Returns a reversed string or an array with reverse order of elements.
 Since
1.5.0

def
rint(columnName: String): Column
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
 Since
1.4.0

def
rint(e: Column): Column
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
Returns the double value that is closest in value to the argument and is equal to a mathematical integer.
 Since
1.4.0

def
round(e: Column, scale: Int): Column
Round the value of
e
toscale
decimal places with HALF_UP round mode ifscale
is greater than or equal to 0 or at integral part whenscale
is less than 0.Round the value of
e
toscale
decimal places with HALF_UP round mode ifscale
is greater than or equal to 0 or at integral part whenscale
is less than 0. Since
1.5.0

def
round(e: Column): Column
Returns the value of the column
e
rounded to 0 decimal places with HALF_UP round mode.Returns the value of the column
e
rounded to 0 decimal places with HALF_UP round mode. Since
1.5.0

def
row_number(): Column
Window function: returns a sequential number starting at 1 within a window partition.
Window function: returns a sequential number starting at 1 within a window partition.
 Since
1.6.0

def
rpad(str: Column, len: Int, pad: Array[Byte]): Column
Rightpad the binary column with pad to a byte length of len.
Rightpad the binary column with pad to a byte length of len. If the binary column is longer than len, the return value is shortened to len bytes.
 Since
3.3.0

def
rpad(str: Column, len: Int, pad: String): Column
Rightpad the string column with pad to a length of len.
Rightpad the string column with pad to a length of len. If the string column is longer than len, the return value is shortened to len characters.
 Since
1.5.0

def
rtrim(e: Column, trimString: String): Column
Trim the specified character string from right end for the specified string column.
Trim the specified character string from right end for the specified string column.
 Since
2.3.0

def
rtrim(e: Column): Column
Trim the spaces from right end for the specified string value.
Trim the spaces from right end for the specified string value.
 Since
1.5.0

def
schema_of_csv(csv: Column, options: Map[String, String]): Column
Parses a CSV string and infers its schema in DDL format using options.
Parses a CSV string and infers its schema in DDL format using options.
 csv
a foldable string column containing a CSV string.
 options
options to control how the CSV is parsed. accepts the same options and the CSV data source. See Data Source Option in the version you use.
 returns
a column with string literal containing schema in DDL format.
 Since
3.0.0

def
schema_of_csv(csv: Column): Column
Parses a CSV string and infers its schema in DDL format.
Parses a CSV string and infers its schema in DDL format.
 csv
a foldable string column containing a CSV string.
 Since
3.0.0

def
schema_of_csv(csv: String): Column
Parses a CSV string and infers its schema in DDL format.
Parses a CSV string and infers its schema in DDL format.
 csv
a CSV string.
 Since
3.0.0

def
schema_of_json(json: Column, options: Map[String, String]): Column
Parses a JSON string and infers its schema in DDL format using options.
Parses a JSON string and infers its schema in DDL format using options.
 json
a foldable string column containing JSON data.
 options
options to control how the json is parsed. accepts the same options and the json data source. See Data Source Option in the version you use.
 returns
a column with string literal containing schema in DDL format.
 Since
3.0.0

def
schema_of_json(json: Column): Column
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
 json
a foldable string column containing a JSON string.
 Since
2.4.0

def
schema_of_json(json: String): Column
Parses a JSON string and infers its schema in DDL format.
Parses a JSON string and infers its schema in DDL format.
 json
a JSON string.
 Since
2.4.0

def
sec(e: Column): Column
 e
angle in radians
 returns
secant of the angle
 Since
3.3.0

def
second(e: Column): Column
Extracts the seconds as an integer from a given date/timestamp/string.
Extracts the seconds as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a timestamp
 Since
1.5.0

def
sentences(string: Column): Column
Splits a string into arrays of sentences, where each sentence is an array of words.
Splits a string into arrays of sentences, where each sentence is an array of words. The default locale is used.
 Since
3.2.0

def
sentences(string: Column, language: Column, country: Column): Column
Splits a string into arrays of sentences, where each sentence is an array of words.
Splits a string into arrays of sentences, where each sentence is an array of words.
 Since
3.2.0

def
sequence(start: Column, stop: Column): Column
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise 1.
Generate a sequence of integers from start to stop, incrementing by 1 if start is less than or equal to stop, otherwise 1.
 Since
2.4.0

def
sequence(start: Column, stop: Column, step: Column): Column
Generate a sequence of integers from start to stop, incrementing by step.
Generate a sequence of integers from start to stop, incrementing by step.
 Since
2.4.0

def
session_window(timeColumn: Column, gapDuration: Column): Column
Generates session window given a timestamp specifying column.
Generates session window given a timestamp specifying column.
Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. For static gap duration, the length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.
Besides a static gap duration value, users can also provide an expression to specify gap duration dynamically based on the input row. With dynamic gap duration, the closing of a session window does not depend on the latest input anymore. A session window's range is the union of all events' ranges which are determined by event start time and evaluated gap duration during the query execution. Note that the rows with negative or zero gap duration will be filtered out from the aggregation.
Windows can support microsecond precision. gapDuration in the order of months are not supported.
For a streaming query, you may use the function
current_timestamp
to generate windows on processing time. timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
 gapDuration
A column specifying the timeout of the session. It could be static value, e.g.
10 minutes
,1 second
, or an expression/UDF that specifies gap duration dynamically based on the input row.
 Since
3.2.0

def
session_window(timeColumn: Column, gapDuration: String): Column
Generates session window given a timestamp specifying column.
Generates session window given a timestamp specifying column.
Session window is one of dynamic windows, which means the length of window is varying according to the given inputs. The length of session window is defined as "the timestamp of latest input of the session + gap duration", so when the new inputs are bound to the current session window, the end time of session window can be expanded according to the new inputs.
Windows can support microsecond precision. gapDuration in the order of months are not supported.
For a streaming query, you may use the function
current_timestamp
to generate windows on processing time. timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
 gapDuration
A string specifying the timeout of the session, e.g.
10 minutes
,1 second
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers.
 Since
3.2.0

def
sha1(e: Column): Column
Calculates the SHA1 digest of a binary column and returns the value as a 40 character hex string.
Calculates the SHA1 digest of a binary column and returns the value as a 40 character hex string.
 Since
1.5.0

def
sha2(e: Column, numBits: Int): Column
Calculates the SHA2 family of hash functions of a binary column and returns the value as a hex string.
Calculates the SHA2 family of hash functions of a binary column and returns the value as a hex string.
 e
column to compute SHA2 on.
 numBits
one of 224, 256, 384, or 512.
 Since
1.5.0

def
shiftleft(e: Column, numBits: Int): Column
Shift the given value numBits left.
Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.
 Since
3.2.0

def
shiftright(e: Column, numBits: Int): Column
(Signed) shift the given value numBits right.
(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
 Since
3.2.0

def
shiftrightunsigned(e: Column, numBits: Int): Column
Unsigned shift the given value numBits right.
Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
 Since
3.2.0

def
shuffle(e: Column): Column
Returns a random permutation of the given array.
Returns a random permutation of the given array.
 Since
2.4.0
 Note
The function is nondeterministic.

def
signum(columnName: String): Column
Computes the signum of the given column.
Computes the signum of the given column.
 Since
1.4.0

def
signum(e: Column): Column
Computes the signum of the given value.
Computes the signum of the given value.
 Since
1.4.0

def
sin(columnName: String): Column
 columnName
angle in radians
 returns
sine of the angle, as if computed by
java.lang.Math.sin
 Since
1.4.0

def
sin(e: Column): Column
 e
angle in radians
 returns
sine of the angle, as if computed by
java.lang.Math.sin
 Since
1.4.0

def
sinh(columnName: String): Column
 columnName
hyperbolic angle
 returns
hyperbolic sine of the given value, as if computed by
java.lang.Math.sinh
 Since
1.4.0

def
sinh(e: Column): Column
 e
hyperbolic angle
 returns
hyperbolic sine of the given value, as if computed by
java.lang.Math.sinh
 Since
1.4.0

def
size(e: Column): Column
Returns length of array or map.
Returns length of array or map.
The function returns null for null input if spark.sql.legacy.sizeOfNull is set to false or spark.sql.ansi.enabled is set to true. Otherwise, the function returns 1 for null input. With the default settings, the function returns 1 for null input.
 Since
1.5.0

def
skewness(columnName: String): Column
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
 Since
1.6.0

def
skewness(e: Column): Column
Aggregate function: returns the skewness of the values in a group.
Aggregate function: returns the skewness of the values in a group.
 Since
1.6.0

def
slice(x: Column, start: Column, length: Column): Column
Returns an array containing all the elements in
x
from indexstart
(or starting from the end ifstart
is negative) with the specifiedlength
.Returns an array containing all the elements in
x
from indexstart
(or starting from the end ifstart
is negative) with the specifiedlength
. x
the array column to be sliced
 start
the starting index
 length
the length of the slice
 Since
3.1.0

def
slice(x: Column, start: Int, length: Int): Column
Returns an array containing all the elements in
x
from indexstart
(or starting from the end ifstart
is negative) with the specifiedlength
.Returns an array containing all the elements in
x
from indexstart
(or starting from the end ifstart
is negative) with the specifiedlength
. x
the array column to be sliced
 start
the starting index
 length
the length of the slice
 Since
2.4.0

def
sort_array(e: Column, asc: Boolean): Column
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending or descending order, according to the natural ordering of the array elements. NaN is greater than any nonNaN elements for double/float type. Null elements will be placed at the beginning of the returned array in ascending order or at the end of the returned array in descending order.
 Since
1.5.0

def
sort_array(e: Column): Column
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements.
Sorts the input array for the given column in ascending order, according to the natural ordering of the array elements. Null elements will be placed at the beginning of the returned array.
 Since
1.5.0

def
soundex(e: Column): Column
Returns the soundex code for the specified expression.
Returns the soundex code for the specified expression.
 Since
1.5.0

def
spark_partition_id(): Column
Partition ID.
Partition ID.
 Since
1.6.0
 Note
This is nondeterministic because it depends on data partitioning and task scheduling.

def
split(str: Column, pattern: String, limit: Int): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
 str
a string expression to split
 pattern
a string representing a regular expression. The regex string should be a Java regular expression.
 limit
an integer expression which controls the number of times the regex is applied.
 limit greater than 0: The resulting array's length will not be more than limit, and the resulting array's last entry will contain all input beyond the last matched regex.
 limit less than or equal to 0:
regex
will be applied as many times as possible, and the resulting array can be of any size.
 Since
3.0.0

def
split(str: Column, pattern: String): Column
Splits str around matches of the given pattern.
Splits str around matches of the given pattern.
 str
a string expression to split
 pattern
a string representing a regular expression. The regex string should be a Java regular expression.
 Since
1.5.0

def
sqrt(colName: String): Column
Computes the square root of the specified float value.
Computes the square root of the specified float value.
 Since
1.5.0

def
sqrt(e: Column): Column
Computes the square root of the specified float value.
Computes the square root of the specified float value.
 Since
1.3.0

def
stddev(columnName: String): Column
Aggregate function: alias for
stddev_samp
.Aggregate function: alias for
stddev_samp
. Since
1.6.0

def
stddev(e: Column): Column
Aggregate function: alias for
stddev_samp
.Aggregate function: alias for
stddev_samp
. Since
1.6.0

def
stddev_pop(columnName: String): Column
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
 Since
1.6.0

def
stddev_pop(e: Column): Column
Aggregate function: returns the population standard deviation of the expression in a group.
Aggregate function: returns the population standard deviation of the expression in a group.
 Since
1.6.0

def
stddev_samp(columnName: String): Column
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
 Since
1.6.0

def
stddev_samp(e: Column): Column
Aggregate function: returns the sample standard deviation of the expression in a group.
Aggregate function: returns the sample standard deviation of the expression in a group.
 Since
1.6.0

def
struct(colName: String, colNames: String*): Column
Creates a new struct column that composes multiple input columns.
Creates a new struct column that composes multiple input columns.
 Annotations
 @varargs()
 Since
1.4.0

def
struct(cols: Column*): Column
Creates a new struct column.
Creates a new struct column. If the input column is a column in a
DataFrame
, or a derived column expression that is named (i.e. aliased), its name would be retained as the StructField's name, otherwise, the newly generated StructField's name would be auto generated ascol
with a suffixindex + 1
, i.e. col1, col2, col3, ... Annotations
 @varargs()
 Since
1.4.0

def
substring(str: Column, pos: Int, len: Int): Column
Substring starts at
pos
and is of lengthlen
when str is String type or returns the slice of byte array that starts atpos
in byte and is of lengthlen
when str is Binary typeSubstring starts at
pos
and is of lengthlen
when str is String type or returns the slice of byte array that starts atpos
in byte and is of lengthlen
when str is Binary type Since
1.5.0
 Note
The position is not zero based, but 1 based index.

def
substring_index(str: Column, delim: String, count: Int): Column
Returns the substring from string str before count occurrences of the delimiter delim.
Returns the substring from string str before count occurrences of the delimiter delim. If count is positive, everything the left of the final delimiter (counting from left) is returned. If count is negative, every to the right of the final delimiter (counting from the right) is returned. substring_index performs a casesensitive match when searching for delim.

def
sum(columnName: String): Column
Aggregate function: returns the sum of all values in the given column.
Aggregate function: returns the sum of all values in the given column.
 Since
1.3.0

def
sum(e: Column): Column
Aggregate function: returns the sum of all values in the expression.
Aggregate function: returns the sum of all values in the expression.
 Since
1.3.0

def
sum_distinct(e: Column): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
 Since
3.2.0

final
def
synchronized[T0](arg0: ⇒ T0): T0
 Definition Classes
 AnyRef

def
tan(columnName: String): Column
 columnName
angle in radians
 returns
tangent of the given value, as if computed by
java.lang.Math.tan
 Since
1.4.0

def
tan(e: Column): Column
 e
angle in radians
 returns
tangent of the given value, as if computed by
java.lang.Math.tan
 Since
1.4.0

def
tanh(columnName: String): Column
 columnName
hyperbolic angle
 returns
hyperbolic tangent of the given value, as if computed by
java.lang.Math.tanh
 Since
1.4.0

def
tanh(e: Column): Column
 e
hyperbolic angle
 returns
hyperbolic tangent of the given value, as if computed by
java.lang.Math.tanh
 Since
1.4.0

def
timestamp_seconds(e: Column): Column
Converts the number of seconds from the Unix epoch (19700101T00:00:00Z) to a timestamp.
Converts the number of seconds from the Unix epoch (19700101T00:00:00Z) to a timestamp.
 Since
3.1.0

def
toString(): String
 Definition Classes
 AnyRef → Any

def
to_csv(e: Column): Column
Converts a column containing a
StructType
into a CSV string with the specified schema.Converts a column containing a
StructType
into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type. e
a column containing a struct.
 Since
3.0.0

def
to_csv(e: Column, options: Map[String, String]): Column
(Javaspecific) Converts a column containing a
StructType
into a CSV string with the specified schema.(Javaspecific) Converts a column containing a
StructType
into a CSV string with the specified schema. Throws an exception, in the case of an unsupported type. e
a column containing a struct.
 options
options to control how the struct column is converted into a CSV string. It accepts the same options and the CSV data source. See Data Source Option in the version you use.
 Since
3.0.0

def
to_date(e: Column, fmt: String): Column
Converts the column into a
DateType
with a specified formatConverts the column into a
DateType
with a specified formatSee Datetime Patterns for valid date and time format patterns
 e
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 fmt
A date time pattern detailing the format of
e
whene
is a string returns
A date, or null if
e
was a string that could not be cast to a date orfmt
was an invalid format
 Since
2.2.0

def
to_date(e: Column): Column
Converts the column into
DateType
by casting rules toDateType
.Converts the column into
DateType
by casting rules toDateType
. Since
1.5.0

def
to_json(e: Column): Column
Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema.Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. e
a column containing a struct, an array or a map.
 Since
2.1.0

def
to_json(e: Column, options: Map[String, String]): Column
(Javaspecific) Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema.(Javaspecific) Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. e
a column containing a struct, an array or a map.
 options
options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the
pretty
option which enables pretty JSON generation.
 Since
2.1.0

def
to_json(e: Column, options: Map[String, String]): Column
(Scalaspecific) Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema.(Scalaspecific) Converts a column containing a
StructType
,ArrayType
or aMapType
into a JSON string with the specified schema. Throws an exception, in the case of an unsupported type. e
a column containing a struct, an array or a map.
 options
options to control how the struct column is converted into a json string. accepts the same options and the json data source. See Data Source Option in the version you use. Additionally the function supports the
pretty
option which enables pretty JSON generation.
 Since
2.1.0

def
to_timestamp(s: Column, fmt: String): Column
Converts time string with the given pattern to timestamp.
Converts time string with the given pattern to timestamp.
See Datetime Patterns for valid date and time format patterns
 s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 fmt
A date time pattern detailing the format of
s
whens
is a string returns
A timestamp, or null if
s
was a string that could not be cast to a timestamp orfmt
was an invalid format
 Since
2.2.0

def
to_timestamp(s: Column): Column
Converts to a timestamp by casting rules to
TimestampType
.Converts to a timestamp by casting rules to
TimestampType
. s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
A timestamp, or null if the input was a string that could not be cast to a timestamp
 Since
2.2.0

def
to_utc_timestamp(ts: Column, tz: Column): Column
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '20170714 01:40:00.0'.
 Since
2.4.0

def
to_utc_timestamp(ts: Column, tz: String): Column
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC.
Given a timestamp like '20170714 02:40:00.0', interprets it as a time in the given time zone, and renders that time as a timestamp in UTC. For example, 'GMT+1' would yield '20170714 01:40:00.0'.
 ts
A date, timestamp or string. If a string, the data must be in a format that can be cast to a timestamp, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 tz
A string detailing the time zone ID that the input should be adjusted to. It should be in the format of either regionbased zone IDs or zone offsets. Region IDs must have the form 'area/city', such as 'America/Los_Angeles'. Zone offsets must be in the format '(+)HH:mm', for example '08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names are not recommended to use because they can be ambiguous.
 returns
A timestamp, or null if
ts
was a string that could not be cast to a timestamp ortz
was an invalid value
 Since
1.5.0

def
transform(column: Column, f: (Column, Column) ⇒ Column): Column
Returns an array of elements after applying a transformation to each element in the input array.
Returns an array of elements after applying a transformation to each element in the input array.
df.select(transform(col("i"), (x, i) => x + i))
 column
the input array column
 f
(col, index) => transformed_col, the lambda function to filter the input column given the index. Indices start at 0.
 Since
3.0.0

def
transform(column: Column, f: (Column) ⇒ Column): Column
Returns an array of elements after applying a transformation to each element in the input array.
Returns an array of elements after applying a transformation to each element in the input array.
df.select(transform(col("i"), x => x + 1))
 column
the input array column
 f
col => transformed_col, the lambda function to transform the input column
 Since
3.0.0

def
transform_keys(expr: Column, f: (Column, Column) ⇒ Column): Column
Applies a function to every keyvalue pair in a map and returns a map with the results of those applications as the new keys for the pairs.
Applies a function to every keyvalue pair in a map and returns a map with the results of those applications as the new keys for the pairs.
df.select(transform_keys(col("i"), (k, v) => k + v))
 expr
the input map column
 f
(key, value) => new_key, the lambda function to transform the key of input map column
 Since
3.0.0

def
transform_values(expr: Column, f: (Column, Column) ⇒ Column): Column
Applies a function to every keyvalue pair in a map and returns a map with the results of those applications as the new values for the pairs.
Applies a function to every keyvalue pair in a map and returns a map with the results of those applications as the new values for the pairs.
df.select(transform_values(col("i"), (k, v) => k + v))
 expr
the input map column
 f
(key, value) => new_value, the lambda function to transform the value of input map column
 Since
3.0.0

def
translate(src: Column, matchingString: String, replaceString: String): Column
Translate any character in the src by a character in replaceString.
Translate any character in the src by a character in replaceString. The characters in replaceString correspond to the characters in matchingString. The translate will happen when any character in the string matches the character in the
matchingString
. Since
1.5.0

def
trim(e: Column, trimString: String): Column
Trim the specified character from both ends for the specified string column.
Trim the specified character from both ends for the specified string column.
 Since
2.3.0

def
trim(e: Column): Column
Trim the spaces from both ends for the specified string column.
Trim the spaces from both ends for the specified string column.
 Since
1.5.0

def
trunc(date: Column, format: String): Column
Returns date truncated to the unit specified by the format.
Returns date truncated to the unit specified by the format.
For example,
trunc("20181119 12:01:19", "year")
returns 20180101 date
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 returns
A date, or null if
date
was a string that could not be cast to a date orformat
was an invalid value
 Since
1.5.0

def
typedLit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column
Creates a Column of literal value.
Creates a Column of literal value.
An alias of
typedlit
, and it is encouraged to usetypedlit
directly. Since
2.2.0

def
typedlit[T](literal: T)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[T]): Column
Creates a Column of literal value.
Creates a Column of literal value.
The passed in object is returned directly if it is already a Column. If the object is a Scala Symbol, it is converted into a Column also. Otherwise, a new Column is created to represent the literal value. The difference between this function and lit is that this function can handle parameterized scala types e.g.: List, Seq and Map.
 Since
3.2.0
 Note
typedlit
will call expensive Scala reflection APIs.lit
is preferred if parameterized Scala types are not used.

def
udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT], inputEncoder: Encoder[IN]): UserDefinedFunction
Obtains a
UserDefinedFunction
that wraps the givenAggregator
so that it may be used with untyped Data Frames.Obtains a
UserDefinedFunction
that wraps the givenAggregator
so that it may be used with untyped Data Frames.Aggregator<IN, BUF, OUT> agg = // custom Aggregator Encoder<IN> enc = // input encoder // declare a UDF based on agg UserDefinedFunction aggUDF = udaf(agg, enc) DataFrame aggData = df.agg(aggUDF($"colname")) // register agg as a named function spark.udf.register("myAggName", udaf(agg, enc))
 IN
the aggregator input type
 BUF
the aggregating buffer type
 OUT
the finalized output type
 agg
the typed Aggregator
 inputEncoder
a specific input encoder to use
 returns
a UserDefinedFunction that can be used as an aggregating expression
 Note
This overloading takes an explicit input encoder, to support UDAF declarations in Java.

def
udaf[IN, BUF, OUT](agg: expressions.Aggregator[IN, BUF, OUT])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[IN]): UserDefinedFunction
Obtains a
UserDefinedFunction
that wraps the givenAggregator
so that it may be used with untyped Data Frames.Obtains a
UserDefinedFunction
that wraps the givenAggregator
so that it may be used with untyped Data Frames.val agg = // Aggregator[IN, BUF, OUT] // declare a UDF based on agg val aggUDF = udaf(agg) val aggData = df.agg(aggUDF($"colname")) // register agg as a named function spark.udf.register("myAggName", udaf(agg))
 IN
the aggregator input type
 BUF
the aggregating buffer type
 OUT
the finalized output type
 agg
the typed Aggregator
 returns
a UserDefinedFunction that can be used as an aggregating expression.
 Note
The input encoder is inferred from the input type IN.

def
udf(f: UDF10[_, _, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF10 instance as userdefined function (UDF).
Defines a Java UDF10 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF9[_, _, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF9 instance as userdefined function (UDF).
Defines a Java UDF9 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF8[_, _, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF8 instance as userdefined function (UDF).
Defines a Java UDF8 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF7[_, _, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF7 instance as userdefined function (UDF).
Defines a Java UDF7 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF6[_, _, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF6 instance as userdefined function (UDF).
Defines a Java UDF6 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF5[_, _, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF5 instance as userdefined function (UDF).
Defines a Java UDF5 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF4[_, _, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF4 instance as userdefined function (UDF).
Defines a Java UDF4 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF3[_, _, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF3 instance as userdefined function (UDF).
Defines a Java UDF3 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF2[_, _, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF2 instance as userdefined function (UDF).
Defines a Java UDF2 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF1[_, _], returnType: DataType): UserDefinedFunction
Defines a Java UDF1 instance as userdefined function (UDF).
Defines a Java UDF1 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf(f: UDF0[_], returnType: DataType): UserDefinedFunction
Defines a Java UDF0 instance as userdefined function (UDF).
Defines a Java UDF0 instance as userdefined function (UDF). The caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
2.3.0

def
udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9, A10](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9, A10) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9], arg10: scala.reflect.api.JavaUniverse.TypeTag[A10]): UserDefinedFunction
Defines a Scala closure of 10 arguments as userdefined function (UDF).
Defines a Scala closure of 10 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4, A5, A6, A7, A8, A9](f: (A1, A2, A3, A4, A5, A6, A7, A8, A9) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8], arg9: scala.reflect.api.JavaUniverse.TypeTag[A9]): UserDefinedFunction
Defines a Scala closure of 9 arguments as userdefined function (UDF).
Defines a Scala closure of 9 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4, A5, A6, A7, A8](f: (A1, A2, A3, A4, A5, A6, A7, A8) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7], arg8: scala.reflect.api.JavaUniverse.TypeTag[A8]): UserDefinedFunction
Defines a Scala closure of 8 arguments as userdefined function (UDF).
Defines a Scala closure of 8 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4, A5, A6, A7](f: (A1, A2, A3, A4, A5, A6, A7) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6], arg7: scala.reflect.api.JavaUniverse.TypeTag[A7]): UserDefinedFunction
Defines a Scala closure of 7 arguments as userdefined function (UDF).
Defines a Scala closure of 7 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4, A5, A6](f: (A1, A2, A3, A4, A5, A6) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5], arg6: scala.reflect.api.JavaUniverse.TypeTag[A6]): UserDefinedFunction
Defines a Scala closure of 6 arguments as userdefined function (UDF).
Defines a Scala closure of 6 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4, A5](f: (A1, A2, A3, A4, A5) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4], arg5: scala.reflect.api.JavaUniverse.TypeTag[A5]): UserDefinedFunction
Defines a Scala closure of 5 arguments as userdefined function (UDF).
Defines a Scala closure of 5 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3, A4](f: (A1, A2, A3, A4) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3], arg4: scala.reflect.api.JavaUniverse.TypeTag[A4]): UserDefinedFunction
Defines a Scala closure of 4 arguments as userdefined function (UDF).
Defines a Scala closure of 4 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2, A3](f: (A1, A2, A3) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2], arg3: scala.reflect.api.JavaUniverse.TypeTag[A3]): UserDefinedFunction
Defines a Scala closure of 3 arguments as userdefined function (UDF).
Defines a Scala closure of 3 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1, A2](f: (A1, A2) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1], arg2: scala.reflect.api.JavaUniverse.TypeTag[A2]): UserDefinedFunction
Defines a Scala closure of 2 arguments as userdefined function (UDF).
Defines a Scala closure of 2 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT, A1](f: (A1) ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT], arg1: scala.reflect.api.JavaUniverse.TypeTag[A1]): UserDefinedFunction
Defines a Scala closure of 1 arguments as userdefined function (UDF).
Defines a Scala closure of 1 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
udf[RT](f: () ⇒ RT)(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[RT]): UserDefinedFunction
Defines a Scala closure of 0 arguments as userdefined function (UDF).
Defines a Scala closure of 0 arguments as userdefined function (UDF). The data types are automatically inferred based on the Scala closure's signature. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
. Since
1.3.0

def
unbase64(e: Column): Column
Decodes a BASE64 encoded string column and returns it as a binary column.
Decodes a BASE64 encoded string column and returns it as a binary column. This is the reverse of base64.
 Since
1.5.0

def
unhex(column: Column): Column
Inverse of hex.
Inverse of hex. Interprets each pair of characters as a hexadecimal number and converts to the byte representation of number.
 Since
1.5.0

def
unix_timestamp(s: Column, p: String): Column
Converts time string with given pattern to Unix timestamp (in seconds).
Converts time string with given pattern to Unix timestamp (in seconds).
See Datetime Patterns for valid date and time format patterns
 s
A date, timestamp or string. If a string, the data must be in a format that can be cast to a date, such as
yyyyMMdd
oryyyyMMdd HH:mm:ss.SSSS
 p
A date time pattern detailing the format of
s
whens
is a string returns
A long, or null if
s
was a string that could not be cast to a date orp
was an invalid format
 Since
1.5.0

def
unix_timestamp(s: Column): Column
Converts time string in format yyyyMMdd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
Converts time string in format yyyyMMdd HH:mm:ss to Unix timestamp (in seconds), using the default timezone and the default locale.
 s
A date, timestamp or string. If a string, the data must be in the
yyyyMMdd HH:mm:ss
format returns
A long, or null if the input was a string not of the correct format
 Since
1.5.0

def
unix_timestamp(): Column
Returns the current Unix timestamp (in seconds) as a long.
Returns the current Unix timestamp (in seconds) as a long.
 Since
1.5.0
 Note
All calls of
unix_timestamp
within the same query return the same value (i.e. the current timestamp is calculated at the start of query evaluation).

def
unwrap_udt(column: Column): Column
Unwrap UDT data type column into its underlying type.
Unwrap UDT data type column into its underlying type.
 Since
3.4.0

def
upper(e: Column): Column
Converts a string column to upper case.
Converts a string column to upper case.
 Since
1.3.0

def
var_pop(columnName: String): Column
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
 Since
1.6.0

def
var_pop(e: Column): Column
Aggregate function: returns the population variance of the values in a group.
Aggregate function: returns the population variance of the values in a group.
 Since
1.6.0

def
var_samp(columnName: String): Column
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
 Since
1.6.0

def
var_samp(e: Column): Column
Aggregate function: returns the unbiased variance of the values in a group.
Aggregate function: returns the unbiased variance of the values in a group.
 Since
1.6.0

def
variance(columnName: String): Column
Aggregate function: alias for
var_samp
.Aggregate function: alias for
var_samp
. Since
1.6.0

def
variance(e: Column): Column
Aggregate function: alias for
var_samp
.Aggregate function: alias for
var_samp
. Since
1.6.0

final
def
wait(): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long, arg1: Int): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()

def
weekofyear(e: Column): Column
Extracts the week number as an integer from a given date/timestamp/string.
Extracts the week number as an integer from a given date/timestamp/string.
A week is considered to start on a Monday and week 1 is the first week with more than 3 days, as defined by ISO 8601
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
when(condition: Column, value: Any): Column
Evaluates a list of conditions and returns one of multiple possible result expressions.
Evaluates a list of conditions and returns one of multiple possible result expressions. If otherwise is not defined at the end, null is returned for unmatched conditions.
// Example: encoding gender string column into integer. // Scala: people.select(when(people("gender") === "male", 0) .when(people("gender") === "female", 1) .otherwise(2)) // Java: people.select(when(col("gender").equalTo("male"), 0) .when(col("gender").equalTo("female"), 1) .otherwise(2))
 Since
1.4.0

def
window(timeColumn: Column, windowDuration: String): Column
Generates tumbling time windows given a timestamp specifying column.
Generates tumbling time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 19700101 00:00:00 UTC. The following example takes the average stock price for a one minute tumbling window:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"timestamp", "1 minute"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:0009:01:00 09:01:0009:02:00 09:02:0009:03:00 ...
For a streaming query, you may use the function
current_timestamp
to generate windows on processing time. timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
 windowDuration
A string specifying the width of the window, e.g.
10 minutes
,1 second
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers.
 Since
2.0.0

def
window(timeColumn: Column, windowDuration: String, slideDuration: String): Column
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The windows start beginning at 19700101 00:00:00 UTC. The following example takes the average stock price for a one minute window every 10 seconds:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"timestamp", "1 minute", "10 seconds"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:0009:01:00 09:00:1009:01:10 09:00:2009:01:20 ...
For a streaming query, you may use the function
current_timestamp
to generate windows on processing time. timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
 windowDuration
A string specifying the width of the window, e.g.
10 minutes
,1 second
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example,1 day
always means 86,400,000 milliseconds, not a calendar day. slideDuration
A string specifying the sliding interval of the window, e.g.
1 minute
. A new window will be generated everyslideDuration
. Must be less than or equal to thewindowDuration
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar.
 Since
2.0.0

def
window(timeColumn: Column, windowDuration: String, slideDuration: String, startTime: String): Column
Bucketize rows into one or more time windows given a timestamp specifying column.
Bucketize rows into one or more time windows given a timestamp specifying column. Window starts are inclusive but the window ends are exclusive, e.g. 12:05 will be in the window [12:05,12:10) but not in [12:00,12:05). Windows can support microsecond precision. Windows in the order of months are not supported. The following example takes the average stock price for a one minute window every 10 seconds starting 5 seconds after the hour:
val df = ... // schema => timestamp: TimestampType, stockId: StringType, price: DoubleType df.groupBy(window($"timestamp", "1 minute", "10 seconds", "5 seconds"), $"stockId") .agg(mean("price"))
The windows will look like:
09:00:0509:01:05 09:00:1509:01:15 09:00:2509:01:25 ...
For a streaming query, you may use the function
current_timestamp
to generate windows on processing time. timeColumn
The column or the expression to use as the timestamp for windowing by time. The time column must be of TimestampType or TimestampNTZType.
 windowDuration
A string specifying the width of the window, e.g.
10 minutes
,1 second
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers. Note that the duration is a fixed length of time, and does not vary over time according to a calendar. For example,1 day
always means 86,400,000 milliseconds, not a calendar day. slideDuration
A string specifying the sliding interval of the window, e.g.
1 minute
. A new window will be generated everyslideDuration
. Must be less than or equal to thewindowDuration
. Checkorg.apache.spark.unsafe.types.CalendarInterval
for valid duration identifiers. This duration is likewise absolute, and does not vary according to a calendar. startTime
The offset with respect to 19700101 00:00:00 UTC with which to start window intervals. For example, in order to have hourly tumbling windows that start 15 minutes past the hour, e.g. 12:1513:15, 13:1514:15... provide
startTime
as15 minutes
.
 Since
2.0.0

def
xxhash64(cols: Column*): Column
Calculates the hash code of given columns using the 64bit variant of the xxHash algorithm, and returns the result as a long column.
Calculates the hash code of given columns using the 64bit variant of the xxHash algorithm, and returns the result as a long column.
 Annotations
 @varargs()
 Since
3.0.0

def
year(e: Column): Column
Extracts the year as an integer from a given date/timestamp/string.
Extracts the year as an integer from a given date/timestamp/string.
 returns
An integer, or null if the input was a string that could not be cast to a date
 Since
1.5.0

def
years(e: Column): Column
A transform for timestamps and dates to partition data into years.
A transform for timestamps and dates to partition data into years.
 Since
3.0.0

def
zip_with(left: Column, right: Column, f: (Column, Column) ⇒ Column): Column
Merge two given arrays, elementwise, into a single array using a function.
Merge two given arrays, elementwise, into a single array using a function. If one array is shorter, nulls are appended at the end to match the length of the longer array, before applying the function.
df.select(zip_with(df1("val1"), df1("val2"), (x, y) => x + y))
 left
the left input array column
 right
the right input array column
 f
(lCol, rCol) => col, the lambda function to merge two input columns into one column
 Since
3.0.0
Deprecated Value Members

def
approxCountDistinct(columnName: String, rsd: Double): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use approx_count_distinct
 Since
1.3.0

def
approxCountDistinct(e: Column, rsd: Double): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use approx_count_distinct
 Since
1.3.0

def
approxCountDistinct(columnName: String): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use approx_count_distinct
 Since
1.3.0

def
approxCountDistinct(e: Column): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use approx_count_distinct
 Since
1.3.0

def
bitwiseNOT(e: Column): Column
Computes bitwise NOT (~) of a number.
Computes bitwise NOT (~) of a number.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use bitwise_not
 Since
1.4.0

def
callUDF(udfName: String, cols: Column*): Column
Call an userdefined function.
Call an userdefined function.
 Annotations
 @varargs() @deprecated
 Deprecated
Use call_udf
 Since
1.5.0

def
monotonicallyIncreasingId(): Column
A column expression that generates monotonically increasing 64bit integers.
A column expression that generates monotonically increasing 64bit integers.
The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. The current implementation puts the partition ID in the upper 31 bits, and the record number within each partition in the lower 33 bits. The assumption is that the data frame has less than 1 billion partitions, and each partition has less than 8 billion records.
As an example, consider a
DataFrame
with two partitions, each with 3 records. This expression would return the following IDs:0, 1, 2, 8589934592 (1L << 33), 8589934593, 8589934594.
 Annotations
 @deprecated
 Deprecated
(Since version 2.0.0) Use monotonically_increasing_id()
 Since
1.4.0

def
shiftLeft(e: Column, numBits: Int): Column
Shift the given value numBits left.
Shift the given value numBits left. If the given value is a long value, this function will return a long value else it will return an integer value.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use shiftleft
 Since
1.5.0

def
shiftRight(e: Column, numBits: Int): Column
(Signed) shift the given value numBits right.
(Signed) shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use shiftright
 Since
1.5.0

def
shiftRightUnsigned(e: Column, numBits: Int): Column
Unsigned shift the given value numBits right.
Unsigned shift the given value numBits right. If the given value is a long value, it will return a long value else it will return an integer value.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use shiftrightunsigned
 Since
1.5.0

def
sumDistinct(columnName: String): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use sum_distinct
 Since
1.3.0

def
sumDistinct(e: Column): Column
Aggregate function: returns the sum of distinct values in the expression.
Aggregate function: returns the sum of distinct values in the expression.
 Annotations
 @deprecated
 Deprecated
(Since version 3.2.0) Use sum_distinct
 Since
1.3.0

def
toDegrees(columnName: String): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use degrees
 Since
1.4.0

def
toDegrees(e: Column): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use degrees
 Since
1.4.0

def
toRadians(columnName: String): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use radians
 Since
1.4.0

def
toRadians(e: Column): Column
 Annotations
 @deprecated
 Deprecated
(Since version 2.1.0) Use radians
 Since
1.4.0

def
udf(f: AnyRef, dataType: DataType): UserDefinedFunction
Defines a deterministic userdefined function (UDF) using a Scala closure.
Defines a deterministic userdefined function (UDF) using a Scala closure. For this variant, the caller must specify the output data type, and there is no automatic input type coercion. By default the returned UDF is deterministic. To change it to nondeterministic, call the API
UserDefinedFunction.asNondeterministic()
.Note that, although the Scala closure can have primitivetype function argument, it doesn't work well with null values. Because the Scala closure is passed in as Any type, there is no type information for the function arguments. Without the type information, Spark may blindly pass null to the Scala closure with primitivetype argument, and the closure will see the default value of the Java type for the null argument, e.g.
udf((x: Int) => x, IntegerType)
, the result is 0 for null input. f
A closure in Scala
 dataType
The output data type of the UDF
 Annotations
 @deprecated
 Deprecated
(Since version 3.0.0)
 Since
2.0.0