KMeansModel

class pyspark.mllib.clustering.KMeansModel(centers: List[VectorLike])

A clustering model derived from the k-means method.

Examples

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> model = KMeans.train(
...     sc.parallelize(data), 2, maxIterations=10, initializationMode="random",
...                    seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0]))
True
>>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0]))
True
>>> model.k
2
>>> model.computeCost(sc.parallelize(data))
2.0
>>> model = KMeans.train(sc.parallelize(data), 2)
>>> sparse_data = [
...     SparseVector(3, {1: 1.0}),
...     SparseVector(3, {1: 1.1}),
...     SparseVector(3, {2: 1.0}),
...     SparseVector(3, {2: 1.1})
... ]
>>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||",
...                                     seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.]))
True
>>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1]))
True
>>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
True
>>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
True
>>> isinstance(model.clusterCenters, list)
True
>>> import os, tempfile
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = KMeansModel.load(sc, path)
>>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0])
True
>>> from shutil import rmtree
>>> try:
...     rmtree(path)
... except OSError:
...     pass
>>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2)
>>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0,
...     initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)]))
>>> model.clusterCenters
[array([-1000., -1000.]), array([ 5.,  5.]), array([ 1000.,  1000.])]

Methods

computeCost(rdd)

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

load(sc, path)

Load a model from the given path.

predict(x)

Find the cluster that each of the points belongs to in this model.

save(sc, path)

Save this model to the given path.

Attributes

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

k

Total number of clusters.

Methods Documentation

computeCost(rdd: pyspark.rdd.RDD[VectorLike]) → float

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

Parameters
rdd:pyspark.RDD

The RDD of points to compute the cost on.

classmethod load(sc: pyspark.context.SparkContext, path: str)pyspark.mllib.clustering.KMeansModel

Load a model from the given path.

predict(x: Union[VectorLike, pyspark.rdd.RDD[VectorLike]]) → Union[int, pyspark.rdd.RDD[int]]

Find the cluster that each of the points belongs to in this model.

Parameters
xpyspark.mllib.linalg.Vector or pyspark.RDD

A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns
int or pyspark.RDD of int

Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

save(sc: pyspark.context.SparkContext, path: str) → None

Save this model to the given path.

Attributes Documentation

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

k

Total number of clusters.