pyspark.streaming.DStream.reduceByWindow¶

DStream.reduceByWindow(reduceFunc: Callable[[T, T], T], invReduceFunc: Optional[Callable[[T, T], T]], windowDuration: int, slideDuration: int) → pyspark.streaming.dstream.DStream[T]¶

Return a new DStream in which each RDD has a single element generated by reducing all elements in a sliding window over this DStream.

if invReduceFunc is not None, the reduction is done incrementally using the old window’s reduced value :

reduce the new values that entered the window (e.g., adding new counts)

2. “inverse reduce” the old values that left the window (e.g., subtracting old counts) This is more efficient than invReduceFunc is None.

Parameters

reduceFuncfunction: associative and commutative reduce function
invReduceFuncfunction: inverse reduce function of reduceFunc; such that for all y, and invertible x: invReduceFunc(reduceFunc(x, y), x) = y
windowDurationint: width of the window; must be a multiple of this DStream’s batching interval
slideDurationint: sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream’s batching interval

pyspark.streaming.DStream.reduceByKeyAndWindow

pyspark.streaming.DStream.repartition