ChiSqSelector

class pyspark.mllib.feature.ChiSqSelector(numTopFeatures: int = 50, selectorType: str = 'numTopFeatures', percentile: float = 0.1, fpr: float = 0.05, fdr: float = 0.05, fwe: float = 0.05)

Creates a ChiSquared feature selector. The selector supports different selection methods: numTopFeatures, percentile, fpr, fdr, fwe.

  • numTopFeatures chooses a fixed number of top features according to a chi-squared test.

  • percentile is similar but chooses a fraction of all features instead of a fixed number.

  • fpr chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection.

  • fdr uses the Benjamini-Hochberg procedure to choose all features whose false discovery rate is below a threshold.

  • fwe chooses all features whose p-values are below a threshold. The threshold is scaled by 1/numFeatures, thus controlling the family-wise error rate of selection.

By default, the selection method is numTopFeatures, with the default number of top features set to 50.

Examples

>>> from pyspark.mllib.linalg import SparseVector, DenseVector
>>> from pyspark.mllib.regression import LabeledPoint
>>> data = sc.parallelize([
...     LabeledPoint(0.0, SparseVector(3, {0: 8.0, 1: 7.0})),
...     LabeledPoint(1.0, SparseVector(3, {1: 9.0, 2: 6.0})),
...     LabeledPoint(1.0, [0.0, 9.0, 8.0]),
...     LabeledPoint(2.0, [7.0, 9.0, 5.0]),
...     LabeledPoint(2.0, [8.0, 7.0, 3.0])
... ])
>>> model = ChiSqSelector(numTopFeatures=1).fit(data)
>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
SparseVector(1, {})
>>> model.transform(DenseVector([7.0, 9.0, 5.0]))
DenseVector([7.0])
>>> model = ChiSqSelector(selectorType="fpr", fpr=0.2).fit(data)
>>> model.transform(SparseVector(3, {1: 9.0, 2: 6.0}))
SparseVector(1, {})
>>> model.transform(DenseVector([7.0, 9.0, 5.0]))
DenseVector([7.0])
>>> model = ChiSqSelector(selectorType="percentile", percentile=0.34).fit(data)
>>> model.transform(DenseVector([7.0, 9.0, 5.0]))
DenseVector([7.0])

Methods

fit(data)

Returns a ChiSquared feature selector.

setFdr(fdr)

set FDR [0.0, 1.0] for feature selection by FDR.

setFpr(fpr)

set FPR [0.0, 1.0] for feature selection by FPR.

setFwe(fwe)

set FWE [0.0, 1.0] for feature selection by FWE.

setNumTopFeatures(numTopFeatures)

set numTopFeature for feature selection by number of top features.

setPercentile(percentile)

set percentile [0.0, 1.0] for feature selection by percentile.

setSelectorType(selectorType)

set the selector type of the ChisqSelector.

Methods Documentation

fit(data: pyspark.rdd.RDD[pyspark.mllib.regression.LabeledPoint])pyspark.mllib.feature.ChiSqSelectorModel

Returns a ChiSquared feature selector.

Parameters
datapyspark.RDD of pyspark.mllib.regression.LabeledPoint

containing the labeled dataset with categorical features. Real-valued features will be treated as categorical for each distinct value. Apply feature discretizer before using this function.

setFdr(fdr: float)pyspark.mllib.feature.ChiSqSelector

set FDR [0.0, 1.0] for feature selection by FDR. Only applicable when selectorType = “fdr”.

setFpr(fpr: float)pyspark.mllib.feature.ChiSqSelector

set FPR [0.0, 1.0] for feature selection by FPR. Only applicable when selectorType = “fpr”.

setFwe(fwe: float)pyspark.mllib.feature.ChiSqSelector

set FWE [0.0, 1.0] for feature selection by FWE. Only applicable when selectorType = “fwe”.

setNumTopFeatures(numTopFeatures: int)pyspark.mllib.feature.ChiSqSelector

set numTopFeature for feature selection by number of top features. Only applicable when selectorType = “numTopFeatures”.

setPercentile(percentile: float)pyspark.mllib.feature.ChiSqSelector

set percentile [0.0, 1.0] for feature selection by percentile. Only applicable when selectorType = “percentile”.

setSelectorType(selectorType: str)pyspark.mllib.feature.ChiSqSelector

set the selector type of the ChisqSelector. Supported options: “numTopFeatures” (default), “percentile”, “fpr”, “fdr”, “fwe”.