Word2Vec

class pyspark.mllib.feature.Word2Vec

Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.

We used skip-gram model in our implementation and hierarchical softmax method to train the model. The variable names in the implementation matches the original C implementation.

For original C implementation, see https://code.google.com/p/word2vec/ For research papers, see Efficient Estimation of Word Representations in Vector Space and Distributed Representations of Words and Phrases and their Compositionality.

Examples

>>> sentence = "a b " * 100 + "a c " * 10
>>> localDoc = [sentence, sentence]
>>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
>>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)

Querying for synonyms of a word will not return that word:

>>> syms = model.findSynonyms("a", 2)
>>> [s[0] for s in syms]
['b', 'c']

But querying for synonyms of a vector may return the word whose representation is that vector:

>>> vec = model.transform("a")
>>> syms = model.findSynonyms(vec, 2)
>>> [s[0] for s in syms]
['a', 'b']
>>> import os, tempfile
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = Word2VecModel.load(sc, path)
>>> model.transform("a") == sameModel.transform("a")
True
>>> syms = sameModel.findSynonyms("a", 2)
>>> [s[0] for s in syms]
['b', 'c']
>>> from shutil import rmtree
>>> try:
...     rmtree(path)
... except OSError:
...     pass

Methods

fit(data)

Computes the vector representation of each word in vocabulary.

setLearningRate(learningRate)

Sets initial learning rate (default: 0.025).

setMinCount(minCount)

Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5).

setNumIterations(numIterations)

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setNumPartitions(numPartitions)

Sets number of partitions (default: 1).

setSeed(seed)

Sets random seed.

setVectorSize(vectorSize)

Sets vector size (default: 100).

setWindowSize(windowSize)

Sets window size (default: 5).

Methods Documentation

fit(data: pyspark.rdd.RDD[List[str]])pyspark.mllib.feature.Word2VecModel

Computes the vector representation of each word in vocabulary.

Parameters
datapyspark.RDD

training data. RDD of list of string

Returns
Word2VecModel
setLearningRate(learningRate: float)pyspark.mllib.feature.Word2Vec

Sets initial learning rate (default: 0.025).

setMinCount(minCount: int)pyspark.mllib.feature.Word2Vec

Sets minCount, the minimum number of times a token must appear to be included in the word2vec model’s vocabulary (default: 5).

setNumIterations(numIterations: int)pyspark.mllib.feature.Word2Vec

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setNumPartitions(numPartitions: int)pyspark.mllib.feature.Word2Vec

Sets number of partitions (default: 1). Use a small number for accuracy.

setSeed(seed: int)pyspark.mllib.feature.Word2Vec

Sets random seed.

setVectorSize(vectorSize: int)pyspark.mllib.feature.Word2Vec

Sets vector size (default: 100).

setWindowSize(windowSize: int)pyspark.mllib.feature.Word2Vec

Sets window size (default: 5).