KolmogorovSmirnovTest

class pyspark.ml.stat.KolmogorovSmirnovTest

Conduct the two-sided Kolmogorov Smirnov (KS) test for data sampled from a continuous distribution.

By comparing the largest difference between the empirical cumulative distribution of the sample data and the theoretical distribution we can provide a test for the the null hypothesis that the sample data comes from that theoretical distribution.

Methods

test(dataset, sampleCol, distName, *params)

Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality.

Methods Documentation

static test(dataset: pyspark.sql.dataframe.DataFrame, sampleCol: str, distName: str, *params: float) → pyspark.sql.dataframe.DataFrame

Conduct a one-sample, two-sided Kolmogorov-Smirnov test for probability distribution equality. Currently supports the normal distribution, taking as parameters the mean and standard deviation.

Parameters
datasetpyspark.sql.DataFrame

a Dataset or a DataFrame containing the sample of data to test.

sampleColstr

Name of sample column in dataset, of any numerical type.

distNamestr

a string name for a theoretical distribution, currently only support “norm”.

paramsfloat

a list of float values specifying the parameters to be used for the theoretical distribution. For “norm” distribution, the parameters includes mean and variance.

Returns
A DataFrame that contains the Kolmogorov-Smirnov test result for the input sampled data.
This DataFrame will contain a single Row with the following fields:
  • pValue: Double
  • statistic: Double

Examples

>>> from pyspark.ml.stat import KolmogorovSmirnovTest
>>> dataset = [[-1.0], [0.0], [1.0]]
>>> dataset = spark.createDataFrame(dataset, ['sample'])
>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 0.0, 1.0).first()
>>> round(ksResult.pValue, 3)
1.0
>>> round(ksResult.statistic, 3)
0.175
>>> dataset = [[2.0], [3.0], [4.0]]
>>> dataset = spark.createDataFrame(dataset, ['sample'])
>>> ksResult = KolmogorovSmirnovTest.test(dataset, 'sample', 'norm', 3.0, 1.0).first()
>>> round(ksResult.pValue, 3)
1.0
>>> round(ksResult.statistic, 3)
0.175