LDAModel

class pyspark.mllib.clustering.LDAModel(java_model: py4j.java_gateway.JavaObject)

A clustering model derived from the LDA method.

Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology

  • “word” = “term”: an element of the vocabulary

  • “token”: instance of a term appearing in a document

  • “topic”: multinomial distribution over words representing some concept

Notes

See the original LDA paper (journal version) [1]

1

Blei, D. et al. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (2003): 993-1022. https://www.jmlr.org/papers/v3/blei03a

Examples

>>> from pyspark.mllib.linalg import Vectors
>>> from numpy.testing import assert_almost_equal, assert_equal
>>> data = [
...     [1, Vectors.dense([0.0, 1.0])],
...     [2, SparseVector(2, {0: 1.0})],
... ]
>>> rdd =  sc.parallelize(data)
>>> model = LDA.train(rdd, k=2, seed=1)
>>> model.vocabSize()
2
>>> model.describeTopics()
[([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])]
>>> model.describeTopics(1)
[([1], [0.5...]), ([0], [0.5...])]
>>> topics = model.topicsMatrix()
>>> topics_expect = array([[0.5,  0.5], [0.5, 0.5]])
>>> assert_almost_equal(topics, topics_expect, 1)
>>> import os, tempfile
>>> from shutil import rmtree
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = LDAModel.load(sc, path)
>>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix())
>>> sameModel.vocabSize() == model.vocabSize()
True
>>> try:
...     rmtree(path)
... except OSError:
...     pass

Methods

call(name, *a)

Call method of java_model

describeTopics([maxTermsPerTopic])

Return the topics described by weighted terms.

load(sc, path)

Load the LDAModel from disk.

save(sc, path)

Save this model to the given path.

topicsMatrix()

Inferred topics, where each topic is represented by a distribution over terms.

vocabSize()

Vocabulary size (number of terms or terms in the vocabulary)

Methods Documentation

call(name: str, *a: Any) → Any

Call method of java_model

describeTopics(maxTermsPerTopic: Optional[int] = None) → List[Tuple[List[int], List[float]]]

Return the topics described by weighted terms.

Warning

If vocabSize and k are large, this can return a large object!

Parameters
maxTermsPerTopicint, optional

Maximum number of terms to collect for each topic. (default: vocabulary size)

Returns
list

Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.

classmethod load(sc: pyspark.context.SparkContext, path: str)pyspark.mllib.clustering.LDAModel

Load the LDAModel from disk.

Parameters
scpyspark.SparkContext
pathstr

Path to where the model is stored.

save(sc: pyspark.context.SparkContext, path: str) → None

Save this model to the given path.

topicsMatrix() → numpy.ndarray

Inferred topics, where each topic is represented by a distribution over terms.

vocabSize() → int

Vocabulary size (number of terms or terms in the vocabulary)