pyspark.RDD.countApproxDistinct¶

RDD.countApproxDistinct(relativeSD: float = 0.05) → int¶

Return approximate number of distinct elements in the RDD.

Parameters

relativeSDfloat, optional: Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.

Notes

The algorithm used is based on streamlib’s implementation of “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”, available here.

Examples

>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
>>> 900 < n < 1100
True
>>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
>>> 16 < n < 24
True

pyspark.RDD.countApprox

pyspark.RDD.countByKey