pyspark.RDD.intersection¶

RDD.intersection(other: pyspark.rdd.RDD[T]) → pyspark.rdd.RDD[T]¶

Return the intersection of this RDD and another one. The output will not contain any duplicate elements, even if the input RDDs did.

Notes

This method performs a shuffle internally.

Examples

>>> rdd1 = sc.parallelize([1, 10, 2, 3, 4, 5])
>>> rdd2 = sc.parallelize([1, 6, 2, 3, 7, 8])
>>> rdd1.intersection(rdd2).collect()
[1, 2, 3]

pyspark.RDD.id

pyspark.RDD.isCheckpointed