pyspark.RDD.zipWithUniqueId

RDD.zipWithUniqueId() → pyspark.rdd.RDD[Tuple[T, int]]

Zips this RDD with generated unique Long ids.

Items in the kth partition will get ids k, n+k, 2*n+k, …, where n is the number of partitions. So there may exist gaps, but this method won’t trigger a spark job, which is different from zipWithIndex().

Examples

>>> sc.parallelize(["a", "b", "c", "d", "e"], 3).zipWithUniqueId().collect()
[('a', 0), ('b', 1), ('c', 4), ('d', 2), ('e', 5)]