pyspark.RDD.aggregate¶
-
RDD.
aggregate
(zeroValue: U, seqOp: Callable[[U, T], U], combOp: Callable[[U, U], U]) → U¶ Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral “zero value.”
The functions
op(t1, t2)
is allowed to modifyt1
and return it as its result value to avoid object allocation; however, it should not modifyt2
.The first function (seqOp) can return a different result type, U, than the type of this RDD. Thus, we need one operation for merging a T into an U and one operation for merging two U
Examples
>>> seqOp = (lambda x, y: (x[0] + y, x[1] + 1)) >>> combOp = (lambda x, y: (x[0] + y[0], x[1] + y[1])) >>> sc.parallelize([1, 2, 3, 4]).aggregate((0, 0), seqOp, combOp) (10, 4) >>> sc.parallelize([]).aggregate((0, 0), seqOp, combOp) (0, 0)