HashingTF

class pyspark.mllib.feature.HashingTF(numFeatures: int = 1048576)

Maps a sequence of terms to their term frequencies using the hashing trick.

Parameters
numFeaturesint, optional

number of features (default: 2^20)

Notes

The terms must be hashable (can not be dict/set/list…).

Examples

>>> htf = HashingTF(100)
>>> doc = "a a b b c d".split(" ")
>>> htf.transform(doc)
SparseVector(100, {...})

Methods

indexOf(term)

Returns the index of the input term.

setBinary(value)

If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False)

transform(document)

Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.

Methods Documentation

indexOf(term: Hashable) → int

Returns the index of the input term.

setBinary(value: bool)pyspark.mllib.feature.HashingTF

If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False)

transform(document: Union[Iterable[Hashable], pyspark.rdd.RDD[Iterable[Hashable]]]) → Union[pyspark.mllib.linalg.Vector, pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]]

Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.