ChiSquareTest¶
-
class
pyspark.ml.stat.
ChiSquareTest
¶ Conduct Pearson’s independence test for every feature against the label. For each feature, the (feature, label) pairs are converted into a contingency matrix for which the Chi-squared statistic is computed. All label and feature values must be categorical.
The null hypothesis is that the occurrence of the outcomes is statistically independent.
Methods
test
(dataset, featuresCol, labelCol[, flatten])Perform a Pearson’s independence test using dataset.
Methods Documentation
-
static
test
(dataset: pyspark.sql.dataframe.DataFrame, featuresCol: str, labelCol: str, flatten: bool = False) → pyspark.sql.dataframe.DataFrame¶ Perform a Pearson’s independence test using dataset.
Added optional
flatten
argument.- Parameters
- dataset
pyspark.sql.DataFrame
DataFrame of categorical labels and categorical features. Real-valued features will be treated as categorical for each distinct value.
- featuresColstr
Name of features column in dataset, of type Vector (VectorUDT).
- labelColstr
Name of label column in dataset, of any numerical type.
- flattenbool, optional
if True, flattens the returned dataframe.
- dataset
- Returns
pyspark.sql.DataFrame
DataFrame containing the test result for every feature against the label. If flatten is True, this DataFrame will contain one row per feature with the following fields:
featureIndex: int
pValue: float
degreesOfFreedom: int
statistic: float
If flatten is False, this DataFrame will contain a single Row with the following fields:
pValues: Vector
degreesOfFreedom: Array[int]
statistics: Vector
Each of these fields has one value per feature.
Examples
>>> from pyspark.ml.linalg import Vectors >>> from pyspark.ml.stat import ChiSquareTest >>> dataset = [[0, Vectors.dense([0, 0, 1])], ... [0, Vectors.dense([1, 0, 1])], ... [1, Vectors.dense([2, 1, 1])], ... [1, Vectors.dense([3, 1, 1])]] >>> dataset = spark.createDataFrame(dataset, ["label", "features"]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label') >>> chiSqResult.select("degreesOfFreedom").collect()[0] Row(degreesOfFreedom=[3, 1, 0]) >>> chiSqResult = ChiSquareTest.test(dataset, 'features', 'label', True) >>> row = chiSqResult.orderBy("featureIndex").collect() >>> row[0].statistic 4.0
-
static