final class DataFrameStatFunctions extends AnyRef
Statistic functions for DataFrame
s.
 Annotations
 @Stable()
 Since
1.4.0
 Alphabetic
 By Inheritance
 DataFrameStatFunctions
 AnyRef
 Any
 Hide All
 Show All
 Public
 All
Value Members

final
def
!=(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

final
def
##(): Int
 Definition Classes
 AnyRef → Any

final
def
==(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
approxQuantile(cols: Array[String], probabilities: Array[Double], relativeError: Double): Array[Array[Double]]
Calculates the approximate quantiles of numerical columns of a DataFrame.
Calculates the approximate quantiles of numerical columns of a DataFrame.
 cols
the names of the numerical columns
 probabilities
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
 relativeError
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
 returns
the approximate quantiles at the given probabilities of each column
 Since
2.2.0
 Note
null and NaN values will be ignored in numerical columns before calculation. For columns only containing null or NaN values, an empty array is returned.
 See also
approxQuantile(col:Str* approxQuantile)
for detailed description.

def
approxQuantile(col: String, probabilities: Array[Double], relativeError: Double): Array[Double]
Calculates the approximate quantiles of a numerical column of a DataFrame.
Calculates the approximate quantiles of a numerical column of a DataFrame.
The result of this algorithm has the following deterministic bound: If the DataFrame has N elements and if we request the quantile at probability
p
up to errorerr
, then the algorithm will return a samplex
from the DataFrame so that the *exact* rank ofx
is close to (p * N). More precisely,floor((p  err) * N) <= rank(x) <= ceil((p + err) * N)
This method implements a variation of the GreenwaldKhanna algorithm (with some speed optimizations). The algorithm was first present in Spaceefficient Online Computation of Quantile Summaries by Greenwald and Khanna.
 col
the name of the numerical column
 probabilities
a list of quantile probabilities Each number must belong to [0, 1]. For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
 relativeError
The relative target precision to achieve (greater than or equal to 0). If set to zero, the exact quantiles are computed, which could be very expensive. Note that values greater than 1 are accepted but give the same result as 1.
 returns
the approximate quantiles at the given probabilities
 Since
2.0.0
 Note
null and NaN values will be removed from the numerical column before calculation. If the dataframe is empty or the column only contains null or NaN, an empty array is returned.

final
def
asInstanceOf[T0]: T0
 Definition Classes
 Any

def
bloomFilter(col: Column, expectedNumItems: Long, numBits: Long): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
 col
the column over which the filter is built
 expectedNumItems
expected number of items which will be put into the filter.
 numBits
expected number of bits of the filter.
 Since
2.0.0

def
bloomFilter(colName: String, expectedNumItems: Long, numBits: Long): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
 colName
name of the column over which the filter is built
 expectedNumItems
expected number of items which will be put into the filter.
 numBits
expected number of bits of the filter.
 Since
2.0.0

def
bloomFilter(col: Column, expectedNumItems: Long, fpp: Double): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
 col
the column over which the filter is built
 expectedNumItems
expected number of items which will be put into the filter.
 fpp
expected false positive probability of the filter.
 Since
2.0.0

def
bloomFilter(colName: String, expectedNumItems: Long, fpp: Double): BloomFilter
Builds a Bloom filter over a specified column.
Builds a Bloom filter over a specified column.
 colName
name of the column over which the filter is built
 expectedNumItems
expected number of items which will be put into the filter.
 fpp
expected false positive probability of the filter.
 Since
2.0.0

def
clone(): AnyRef
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()

def
corr(col1: String, col2: String): Double
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
Calculates the Pearson Correlation Coefficient of two columns of a DataFrame.
 col1
the name of the column
 col2
the name of the column to calculate the correlation against
 returns
The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2", "pearson") res1: Double = 0.613...
 Since
1.4.0

def
corr(col1: String, col2: String, method: String): Double
Calculates the correlation of two columns of a DataFrame.
Calculates the correlation of two columns of a DataFrame. Currently only supports the Pearson Correlation Coefficient. For Spearman Correlation, consider using RDD methods found in MLlib's Statistics.
 col1
the name of the column
 col2
the name of the column to calculate the correlation against
 returns
The Pearson Correlation Coefficient as a Double.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.corr("rand1", "rand2") res1: Double = 0.613...
 Since
1.4.0

def
countMinSketch(col: Column, eps: Double, confidence: Double, seed: Int): CountMinSketch
Builds a Countmin Sketch over a specified column.
Builds a Countmin Sketch over a specified column.
 col
the column over which the sketch is built
 eps
relative error of the sketch
 confidence
confidence of the sketch
 seed
random seed
 returns
a
CountMinSketch
over columncolName
 Since
2.0.0

def
countMinSketch(col: Column, depth: Int, width: Int, seed: Int): CountMinSketch
Builds a Countmin Sketch over a specified column.
Builds a Countmin Sketch over a specified column.
 col
the column over which the sketch is built
 depth
depth of the sketch
 width
width of the sketch
 seed
random seed
 returns
a
CountMinSketch
over columncolName
 Since
2.0.0

def
countMinSketch(colName: String, eps: Double, confidence: Double, seed: Int): CountMinSketch
Builds a Countmin Sketch over a specified column.
Builds a Countmin Sketch over a specified column.
 colName
name of the column over which the sketch is built
 eps
relative error of the sketch
 confidence
confidence of the sketch
 seed
random seed
 returns
a
CountMinSketch
over columncolName
 Since
2.0.0

def
countMinSketch(colName: String, depth: Int, width: Int, seed: Int): CountMinSketch
Builds a Countmin Sketch over a specified column.
Builds a Countmin Sketch over a specified column.
 colName
name of the column over which the sketch is built
 depth
depth of the sketch
 width
width of the sketch
 seed
random seed
 returns
a
CountMinSketch
over columncolName
 Since
2.0.0

def
cov(col1: String, col2: String): Double
Calculate the sample covariance of two numerical columns of a DataFrame.
Calculate the sample covariance of two numerical columns of a DataFrame.
 col1
the name of the first column
 col2
the name of the second column
 returns
the covariance of the two columns.
val df = sc.parallelize(0 until 10).toDF("id").withColumn("rand1", rand(seed=10)) .withColumn("rand2", rand(seed=27)) df.stat.cov("rand1", "rand2") res1: Double = 0.065...
 Since
1.4.0

def
crosstab(col1: String, col2: String): DataFrame
Computes a pairwise frequency table of the given columns.
Computes a pairwise frequency table of the given columns. Also known as a contingency table. The number of distinct values for each column should be less than 1e4. At most 1e6 nonzero pair frequencies will be returned. The first column of each row will be the distinct values of
col1
and the column names will be the distinct values ofcol2
. The name of the first column will becol1_col2
. Counts will be returned asLong
s. Pairs that have no occurrences will have zero as their counts. Null elements will be replaced by "null", and back ticks will be dropped from elements if they exist. col1
The name of the first column. Distinct items will make the first item of each row.
 col2
The name of the second column. Distinct items will make the column names of the DataFrame.
 returns
A DataFrame containing for the contingency table.
val df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))) .toDF("key", "value") val ct = df.stat.crosstab("key", "value") ct.show() +++++ key_value 1 2 3 +++++  2 2 0 1  1 1 1 0  3 0 1 1 +++++
 Since
1.4.0

final
def
eq(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

def
equals(arg0: Any): Boolean
 Definition Classes
 AnyRef → Any

def
finalize(): Unit
 Attributes
 protected[lang]
 Definition Classes
 AnyRef
 Annotations
 @throws( classOf[java.lang.Throwable] )

def
freqItems(cols: Seq[String]): DataFrame
(Scalaspecific) Finding frequent items for columns, possibly with false positives.
(Scalaspecific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a
default
support of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
. cols
the names of the columns to search frequent items in.
 returns
A Local DataFrame with the Array of frequent items for each column.
 Since
1.4.0

def
freqItems(cols: Seq[String], support: Double): DataFrame
(Scalaspecific) Finding frequent items for columns, possibly with false positives.
(Scalaspecific) Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou.
This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
. cols
the names of the columns to search frequent items in.
 returns
A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, 1.0) else (i, i * 1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Seq("a", "b"), 0.4) freqSingles.show() +++ a_freqItems b_freqItems +++  [1, 99][1.0, 99.0] +++ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("ab")) val freqPairs = pairDf.stat.freqItems(Seq("ab"), 0.1) freqPairs.select(explode($"ab_freqItems").as("freq_ab")).show() ++  freq_ab ++  [1,1.0]  ...  ++
 Since
1.4.0

def
freqItems(cols: Array[String]): DataFrame
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. Uses a
default
support of 1%.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
. cols
the names of the columns to search frequent items in.
 returns
A Local DataFrame with the Array of frequent items for each column.
 Since
1.4.0

def
freqItems(cols: Array[String], support: Double): DataFrame
Finding frequent items for columns, possibly with false positives.
Finding frequent items for columns, possibly with false positives. Using the frequent element count algorithm described in here, proposed by Karp, Schenker, and Papadimitriou. The
support
should be greater than 1e4.This function is meant for exploratory data analysis, as we make no guarantee about the backward compatibility of the schema of the resulting
DataFrame
. cols
the names of the columns to search frequent items in.
 support
The minimum frequency for an item to be considered
frequent
. Should be greater than 1e4. returns
A Local DataFrame with the Array of frequent items for each column.
val rows = Seq.tabulate(100) { i => if (i % 2 == 0) (1, 1.0) else (i, i * 1.0) } val df = spark.createDataFrame(rows).toDF("a", "b") // find the items with a frequency greater than 0.4 (observed 40% of the time) for columns // "a" and "b" val freqSingles = df.stat.freqItems(Array("a", "b"), 0.4) freqSingles.show() +++ a_freqItems b_freqItems +++  [1, 99][1.0, 99.0] +++ // find the pair of items with a frequency greater than 0.1 in columns "a" and "b" val pairDf = df.select(struct("a", "b").as("ab")) val freqPairs = pairDf.stat.freqItems(Array("ab"), 0.1) freqPairs.select(explode($"ab_freqItems").as("freq_ab")).show() ++  freq_ab ++  [1,1.0]  ...  ++
 Since
1.4.0

final
def
getClass(): Class[_]
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

def
hashCode(): Int
 Definition Classes
 AnyRef → Any
 Annotations
 @native()

final
def
isInstanceOf[T0]: Boolean
 Definition Classes
 Any

final
def
ne(arg0: AnyRef): Boolean
 Definition Classes
 AnyRef

final
def
notify(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

final
def
notifyAll(): Unit
 Definition Classes
 AnyRef
 Annotations
 @native()

def
sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
(Javaspecific) Returns a stratified sample without replacement based on the fraction given on each stratum.
(Javaspecific) Returns a stratified sample without replacement based on the fraction given on each stratum.
 T
stratum type
 col
column that defines strata
 fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
 seed
random seed
 returns
a new
DataFrame
that represents the stratified sample
 Since
3.0.0

def
sampleBy[T](col: Column, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
 T
stratum type
 col
column that defines strata
 fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
 seed
random seed
 returns
a new
DataFrame
that represents the stratified sample The stratified sample can be performed over multiple columns:import org.apache.spark.sql.Row import org.apache.spark.sql.functions.struct val df = spark.createDataFrame(Seq(("Bob", 17), ("Alice", 10), ("Nico", 8), ("Bob", 17), ("Alice", 10))).toDF("name", "age") val fractions = Map(Row("Alice", 10) > 0.3, Row("Nico", 8) > 1.0) df.stat.sampleBy(struct($"name", $"age"), fractions, 36L).show() +++  nameage +++  Nico 8 Alice 10 +++
 Since
3.0.0

def
sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
 T
stratum type
 col
column that defines strata
 fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
 seed
random seed
 returns
a new
DataFrame
that represents the stratified sample
 Since
1.5.0

def
sampleBy[T](col: String, fractions: Map[T, Double], seed: Long): DataFrame
Returns a stratified sample without replacement based on the fraction given on each stratum.
Returns a stratified sample without replacement based on the fraction given on each stratum.
 T
stratum type
 col
column that defines strata
 fractions
sampling fraction for each stratum. If a stratum is not specified, we treat its fraction as zero.
 seed
random seed
 returns
a new
DataFrame
that represents the stratified sampleval df = spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value") val fractions = Map(1 > 1.0, 3 > 0.5) df.stat.sampleBy("key", fractions, 36L).show() +++ keyvalue +++  1 1  1 2  3 2 +++
 Since
1.5.0

final
def
synchronized[T0](arg0: ⇒ T0): T0
 Definition Classes
 AnyRef

def
toString(): String
 Definition Classes
 AnyRef → Any

final
def
wait(): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long, arg1: Int): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... )

final
def
wait(arg0: Long): Unit
 Definition Classes
 AnyRef
 Annotations
 @throws( ... ) @native()