LDAModel¶
-
class
pyspark.mllib.clustering.
LDAModel
(java_model: py4j.java_gateway.JavaObject)¶ A clustering model derived from the LDA method.
Latent Dirichlet Allocation (LDA), a topic model designed for text documents. Terminology
“word” = “term”: an element of the vocabulary
“token”: instance of a term appearing in a document
“topic”: multinomial distribution over words representing some concept
Notes
See the original LDA paper (journal version) [1]
- 1
Blei, D. et al. “Latent Dirichlet Allocation.” J. Mach. Learn. Res. 3 (2003): 993-1022. https://www.jmlr.org/papers/v3/blei03a
Examples
>>> from pyspark.mllib.linalg import Vectors >>> from numpy.testing import assert_almost_equal, assert_equal >>> data = [ ... [1, Vectors.dense([0.0, 1.0])], ... [2, SparseVector(2, {0: 1.0})], ... ] >>> rdd = sc.parallelize(data) >>> model = LDA.train(rdd, k=2, seed=1) >>> model.vocabSize() 2 >>> model.describeTopics() [([1, 0], [0.5..., 0.49...]), ([0, 1], [0.5..., 0.49...])] >>> model.describeTopics(1) [([1], [0.5...]), ([0], [0.5...])]
>>> topics = model.topicsMatrix() >>> topics_expect = array([[0.5, 0.5], [0.5, 0.5]]) >>> assert_almost_equal(topics, topics_expect, 1)
>>> import os, tempfile >>> from shutil import rmtree >>> path = tempfile.mkdtemp() >>> model.save(sc, path) >>> sameModel = LDAModel.load(sc, path) >>> assert_equal(sameModel.topicsMatrix(), model.topicsMatrix()) >>> sameModel.vocabSize() == model.vocabSize() True >>> try: ... rmtree(path) ... except OSError: ... pass
Methods
call
(name, *a)Call method of java_model
describeTopics
([maxTermsPerTopic])Return the topics described by weighted terms.
load
(sc, path)Load the LDAModel from disk.
save
(sc, path)Save this model to the given path.
Inferred topics, where each topic is represented by a distribution over terms.
Vocabulary size (number of terms or terms in the vocabulary)
Methods Documentation
-
call
(name: str, *a: Any) → Any¶ Call method of java_model
-
describeTopics
(maxTermsPerTopic: Optional[int] = None) → List[Tuple[List[int], List[float]]]¶ Return the topics described by weighted terms.
Warning
If vocabSize and k are large, this can return a large object!
- Parameters
- maxTermsPerTopicint, optional
Maximum number of terms to collect for each topic. (default: vocabulary size)
- Returns
- list
Array over topics. Each topic is represented as a pair of matching arrays: (term indices, term weights in topic). Each topic’s terms are sorted in order of decreasing weight.
-
classmethod
load
(sc: pyspark.context.SparkContext, path: str) → pyspark.mllib.clustering.LDAModel¶ Load the LDAModel from disk.
- Parameters
- sc
pyspark.SparkContext
- pathstr
Path to where the model is stored.
- sc
-
save
(sc: pyspark.context.SparkContext, path: str) → None¶ Save this model to the given path.
-
topicsMatrix
() → numpy.ndarray¶ Inferred topics, where each topic is represented by a distribution over terms.
-
vocabSize
() → int¶ Vocabulary size (number of terms or terms in the vocabulary)