Summarizer¶
-
class
pyspark.ml.stat.
Summarizer
¶ Tools for vectorized statistics on MLlib Vectors. The methods in this package provide various statistics for Vectors contained inside DataFrames. This class lets users pick the statistics they would like to extract for a given column.
Examples
>>> from pyspark.ml.stat import Summarizer >>> from pyspark.sql import Row >>> from pyspark.ml.linalg import Vectors >>> summarizer = Summarizer.metrics("mean", "count") >>> df = sc.parallelize([Row(weight=1.0, features=Vectors.dense(1.0, 1.0, 1.0)), ... Row(weight=0.0, features=Vectors.dense(1.0, 2.0, 3.0))]).toDF() >>> df.select(summarizer.summary(df.features, df.weight)).show(truncate=False) +-----------------------------------+ |aggregate_metrics(features, weight)| +-----------------------------------+ |{[1.0,1.0,1.0], 1} | +-----------------------------------+ >>> df.select(summarizer.summary(df.features)).show(truncate=False) +--------------------------------+ |aggregate_metrics(features, 1.0)| +--------------------------------+ |{[1.0,1.5,2.0], 2} | +--------------------------------+ >>> df.select(Summarizer.mean(df.features, df.weight)).show(truncate=False) +--------------+ |mean(features)| +--------------+ |[1.0,1.0,1.0] | +--------------+ >>> df.select(Summarizer.mean(df.features)).show(truncate=False) +--------------+ |mean(features)| +--------------+ |[1.0,1.5,2.0] | +--------------+
Methods
count
(col[, weightCol])return a column of count summary
max
(col[, weightCol])return a column of max summary
mean
(col[, weightCol])return a column of mean summary
metrics
(*metrics)Given a list of metrics, provides a builder that it turns computes metrics from a column.
min
(col[, weightCol])return a column of min summary
normL1
(col[, weightCol])return a column of normL1 summary
normL2
(col[, weightCol])return a column of normL2 summary
numNonZeros
(col[, weightCol])return a column of numNonZero summary
std
(col[, weightCol])return a column of std summary
sum
(col[, weightCol])return a column of sum summary
variance
(col[, weightCol])return a column of variance summary
Methods Documentation
-
static
count
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of count summary
-
static
max
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of max summary
-
static
mean
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of mean summary
-
static
metrics
(*metrics: str) → pyspark.ml.stat.SummaryBuilder¶ Given a list of metrics, provides a builder that it turns computes metrics from a column.
See the documentation of
Summarizer
for an example.- The following metrics are accepted (case sensitive):
mean: a vector that contains the coefficient-wise mean.
sum: a vector that contains the coefficient-wise sum.
variance: a vector tha contains the coefficient-wise variance.
std: a vector tha contains the coefficient-wise standard deviation.
count: the count of all vectors seen.
numNonzeros: a vector with the number of non-zeros for each coefficients
max: the maximum for each coefficient.
min: the minimum for each coefficient.
normL2: the Euclidean norm for each coefficient.
normL1: the L1 norm of each coefficient (sum of the absolute values).
- Returns
Notes
Currently, the performance of this interface is about 2x~3x slower than using the RDD interface.
Examples
- metricsstr
metrics that can be provided.
-
static
min
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of min summary
-
static
normL1
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of normL1 summary
-
static
normL2
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of normL2 summary
-
static
numNonZeros
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of numNonZero summary
-
static
std
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of std summary
-
static
sum
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of sum summary
-
static
variance
(col: pyspark.sql.column.Column, weightCol: Optional[pyspark.sql.column.Column] = None) → pyspark.sql.column.Column¶ return a column of variance summary
-
static