PowerIterationClustering¶

class pyspark.ml.clustering.PowerIterationClustering(*, k: int = 2, maxIter: int = 20, initMode: str = 'random', srcCol: str = 'src', dstCol: str = 'dst', weightCol: Optional[str] = None)¶

Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by Lin and Cohen. From the abstract: PIC finds a very low-dimensional embedding of a dataset using truncated power iteration on a normalized pair-wise similarity matrix of the data.

This class is not yet an Estimator/Transformer, use assignClusters() method to run the PowerIterationClustering algorithm.

Notes

See Wikipedia on Spectral clustering

Examples

>>> data = [(1, 0, 0.5),
...         (2, 0, 0.5), (2, 1, 0.7),
...         (3, 0, 0.5), (3, 1, 0.7), (3, 2, 0.9),
...         (4, 0, 0.5), (4, 1, 0.7), (4, 2, 0.9), (4, 3, 1.1),
...         (5, 0, 0.5), (5, 1, 0.7), (5, 2, 0.9), (5, 3, 1.1), (5, 4, 1.3)]
>>> df = spark.createDataFrame(data).toDF("src", "dst", "weight").repartition(1)
>>> pic = PowerIterationClustering(k=2, weightCol="weight")
>>> pic.setMaxIter(40)
PowerIterationClustering...
>>> assignments = pic.assignClusters(df)
>>> assignments.sort(assignments.id).show(truncate=False)
+---+-------+
|id |cluster|
+---+-------+
|0  |0      |
|1  |0      |
|2  |0      |
|3  |0      |
|4  |0      |
|5  |1      |
+---+-------+
...
>>> pic_path = temp_path + "/pic"
>>> pic.save(pic_path)
>>> pic2 = PowerIterationClustering.load(pic_path)
>>> pic2.getK()
2
>>> pic2.getMaxIter()
40
>>> pic2.assignClusters(df).take(6) == assignments.take(6)
True

Methods

`assignClusters`(dataset)	Run the PIC algorithm and returns a cluster assignment for each input vertex.
`clear`(param)	Clears a param from the param map if it has been explicitly set.
`copy`([extra])	Creates a copy of this instance with the same uid and some extra params.
`explainParam`(param)	Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.
`explainParams`()	Returns the documentation of all params with their optionally default values and user-supplied values.
`extractParamMap`([extra])	Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.
`getDstCol`()	Gets the value of `dstCol` or its default value.
`getInitMode`()	Gets the value of `initMode` or its default value.
`getK`()	Gets the value of `k` or its default value.
`getMaxIter`()	Gets the value of maxIter or its default value.
`getOrDefault`(param)	Gets the value of a param in the user-supplied param map or its default value.
`getParam`(paramName)	Gets a param by its name.
`getSrcCol`()	Gets the value of `srcCol` or its default value.
`getWeightCol`()	Gets the value of weightCol or its default value.
`hasDefault`(param)	Checks whether a param has a default value.
`hasParam`(paramName)	Tests whether this instance contains a param with a given (string) name.
`isDefined`(param)	Checks whether a param is explicitly set by user or has a default value.
`isSet`(param)	Checks whether a param is explicitly set by user.
`load`(path)	Reads an ML instance from the input path, a shortcut of read().load(path).
`read`()	Returns an MLReader instance for this class.
`save`(path)	Save this ML instance to the given path, a shortcut of ‘write().save(path)’.
`set`(param, value)	Sets a parameter in the embedded param map.
`setDstCol`(value)	Sets the value of `dstCol`.
`setInitMode`(value)	Sets the value of `initMode`.
`setK`(value)	Sets the value of `k`.
`setMaxIter`(value)	Sets the value of `maxIter`.
`setParams`(self, \*[, k, maxIter, initMode, …])	Sets params for PowerIterationClustering.
`setSrcCol`(value)	Sets the value of `srcCol`.
`setWeightCol`(value)	Sets the value of `weightCol`.
`write`()	Returns an MLWriter instance for this ML instance.

Attributes

`dstCol`
`initMode`
`k`
`maxIter`
`params`	Returns all params ordered by name.
`srcCol`
`weightCol`

Methods Documentation

assignClusters(dataset: pyspark.sql.dataframe.DataFrame) → pyspark.sql.dataframe.DataFrame¶

Run the PIC algorithm and returns a cluster assignment for each input vertex.

Parameters

datasetpyspark.sql.DataFrame: A dataset with columns src, dst, weight representing the affinity matrix, which is the matrix A in the PIC paper. Suppose the src column value is i, the dst column value is j, the weight column value is similarity s,,ij,, which must be nonnegative. This is a symmetric matrix and hence s,,ij,, = s,,ji,,. For any (i, j) with nonzero similarity, there should be either (i, j, s,,ij,,) or (j, i, s,,ji,,) in the input. Rows with i = j are ignored, because we assume s,,ij,, = 0.0.

Returns

pyspark.sql.DataFrame: A dataset that contains columns of vertex id and the corresponding cluster for the id. The schema of it will be: - id: Long - cluster: Int

clear(param: pyspark.ml.param.Param) → None¶: Clears a param from the param map if it has been explicitly set.

copy(extra: Optional[ParamMap] = None) → JP¶

Creates a copy of this instance with the same uid and some extra params. This implementation first calls Params.copy and then make a copy of the companion Java pipeline component with extra params. So both the Python wrapper and the Java pipeline component get copied.

Parameters

extradict, optional: Extra parameters to copy to the new instance

Returns

JavaParams: Copy of this instance

explainParam(param: Union[str, pyspark.ml.param.Param]) → str¶: Explains a single param and returns its name, doc, and optional default value and user-supplied value in a string.

explainParams() → str¶: Returns the documentation of all params with their optionally default values and user-supplied values.

extractParamMap(extra: Optional[ParamMap] = None) → ParamMap¶

Extracts the embedded default param values and user-supplied values, and then merges them with extra values from input into a flat param map, where the latter value is used if there exist conflicts, i.e., with ordering: default param values < user-supplied values < extra.

Parameters

extradict, optional: extra param values

Returns

dict: merged param map

getDstCol() → str¶: Gets the value of dstCol or its default value.

getInitMode() → str¶: Gets the value of initMode or its default value.

getK() → int¶: Gets the value of k or its default value.

getMaxIter() → int¶: Gets the value of maxIter or its default value.

getOrDefault(param: Union[str, pyspark.ml.param.Param[T]]) → Union[Any, T]¶: Gets the value of a param in the user-supplied param map or its default value. Raises an error if neither is set.

getParam(paramName: str) → pyspark.ml.param.Param ¶: Gets a param by its name.

getSrcCol() → str¶: Gets the value of srcCol or its default value.

getWeightCol() → str¶: Gets the value of weightCol or its default value.

hasDefault(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶: Checks whether a param has a default value.

hasParam(paramName: str) → bool¶: Tests whether this instance contains a param with a given (string) name.

isDefined(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶: Checks whether a param is explicitly set by user or has a default value.

isSet(param: Union[str, pyspark.ml.param.Param[Any]]) → bool¶: Checks whether a param is explicitly set by user.

classmethod load(path: str) → RL¶: Reads an ML instance from the input path, a shortcut of read().load(path).

classmethod read() → pyspark.ml.util.JavaMLReader[RL]¶: Returns an MLReader instance for this class.

save(path: str) → None¶: Save this ML instance to the given path, a shortcut of ‘write().save(path)’.

set(param: pyspark.ml.param.Param, value: Any) → None¶: Sets a parameter in the embedded param map.

setDstCol(value: str) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of dstCol.

setInitMode(value: str) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of initMode.

setK(value: int) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of k.

setMaxIter(value: int) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of maxIter.

setParams(self, \*, k=2, maxIter=20, initMode="random", srcCol="src", dstCol="dst", weightCol=None)¶: Sets params for PowerIterationClustering.

setSrcCol(value: str) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of srcCol.

setWeightCol(value: str) → pyspark.ml.clustering.PowerIterationClustering ¶: Sets the value of weightCol.

write() → pyspark.ml.util.JavaMLWriter¶: Returns an MLWriter instance for this ML instance.

Attributes Documentation

dstCol = Param(parent='undefined', name='dstCol', doc='Name of the input column for destination vertex IDs.')¶

initMode = Param(parent='undefined', name='initMode', doc="The initialization algorithm. This can be either 'random' to use a random vector as vertex properties, or 'degree' to use a normalized sum of similarities with other vertices. Supported options: 'random' and 'degree'.")¶

k = Param(parent='undefined', name='k', doc='The number of clusters to create. Must be > 1.')¶

maxIter = Param(parent='undefined', name='maxIter', doc='max number of iterations (>= 0).')¶

params¶: Returns all params ordered by name. The default implementation uses dir() to get all attributes of type Param.

srcCol = Param(parent='undefined', name='srcCol', doc='Name of the input column for source vertex IDs.')¶

weightCol = Param(parent='undefined', name='weightCol', doc='weight column name. If this is not set or empty, we treat all instance weights as 1.0.')¶

DistributedLDAModel

pyspark.ml.functions.array_to_vector