HashingTF¶
-
class
pyspark.mllib.feature.
HashingTF
(numFeatures: int = 1048576)¶ Maps a sequence of terms to their term frequencies using the hashing trick.
- Parameters
- numFeaturesint, optional
number of features (default: 2^20)
Notes
The terms must be hashable (can not be dict/set/list…).
Examples
>>> htf = HashingTF(100) >>> doc = "a a b b c d".split(" ") >>> htf.transform(doc) SparseVector(100, {...})
Methods
indexOf
(term)Returns the index of the input term.
setBinary
(value)If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False)
transform
(document)Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.
Methods Documentation
-
indexOf
(term: Hashable) → int¶ Returns the index of the input term.
-
setBinary
(value: bool) → pyspark.mllib.feature.HashingTF¶ If True, term frequency vector will be binary such that non-zero term counts will be set to 1 (default: False)
-
transform
(document: Union[Iterable[Hashable], pyspark.rdd.RDD[Iterable[Hashable]]]) → Union[pyspark.mllib.linalg.Vector, pyspark.rdd.RDD[pyspark.mllib.linalg.Vector]]¶ Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.