StandardScaler¶
-
class
pyspark.mllib.feature.
StandardScaler
(withMean: bool = False, withStd: bool = True)¶ Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
- Parameters
- withMeanbool, optional
False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input.
- withStdbool, optional
True by default. Scales the data to unit standard deviation.
Examples
>>> vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])] >>> dataset = sc.parallelize(vs) >>> standardizer = StandardScaler(True, True) >>> model = standardizer.fit(dataset) >>> result = model.transform(dataset) >>> for r in result.collect(): r DenseVector([-0.7071, 0.7071, -0.7071]) DenseVector([0.7071, -0.7071, 0.7071]) >>> int(model.std[0]) 4 >>> int(model.mean[0]*10) 9 >>> model.withStd True >>> model.withMean True
Methods
fit
(dataset)Computes the mean and variance and stores as a model to be used for later scaling.
Methods Documentation
-
fit
(dataset: pyspark.rdd.RDD[VectorLike]) → StandardScalerModel¶ Computes the mean and variance and stores as a model to be used for later scaling.
- Parameters
- dataset
pyspark.RDD
The data used to compute the mean and variance to build the transformation model.
- dataset
- Returns