pyspark.streaming.DStream.reduceByKeyAndWindow¶

DStream.reduceByKeyAndWindow(func: Callable[[V, V], V], invFunc: Optional[Callable[[V, V], V]], windowDuration: int, slideDuration: Optional[int] = None, numPartitions: Optional[int] = None, filterFunc: Optional[Callable[[Tuple[K, V]], bool]] = None) → pyspark.streaming.dstream.DStream[Tuple[K, V]]¶

Return a new DStream by applying incremental reduceByKey over a sliding window.

The reduced value of over a new window is calculated using the old window’s reduce value :

reduce the new values that entered the window (e.g., adding new counts)
“inverse reduce” the old values that left the window (e.g., subtracting old counts)

invFunc can be None, then it will reduce all the RDDs in window, could be slower than having invFunc.

Parameters

funcfunction: associative and commutative reduce function
invFuncfunction: inverse function of reduceFunc
windowDurationint: width of the window; must be a multiple of this DStream’s batching interval
slideDurationint, optional: sliding interval of the window (i.e., the interval after which the new DStream will generate RDDs); must be a multiple of this DStream’s batching interval
numPartitionsint, optional: number of partitions of each RDD in the new DStream.
filterFuncfunction, optional: function to filter expired key-value pairs; only pairs that satisfy the function are retained set this to null if you do not want to filter

pyspark.streaming.DStream.reduceByKey

pyspark.streaming.DStream.reduceByWindow