pyspark.RDD.countApproxDistinct¶
-
RDD.
countApproxDistinct
(relativeSD: float = 0.05) → int¶ Return approximate number of distinct elements in the RDD.
- Parameters
- relativeSDfloat, optional
Relative accuracy. Smaller values create counters that require more space. It must be greater than 0.000017.
Notes
The algorithm used is based on streamlib’s implementation of “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm”, available here.
Examples
>>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct() >>> 900 < n < 1100 True >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct() >>> 16 < n < 24 True