pyspark.pandas.Series.pandas_on_spark.transform_batch¶
- 
pandas_on_spark.transform_batch(func: Callable[[…], pandas.core.series.Series], *args: Any, **kwargs: Any) → Series¶
- Transform the data with the function that takes pandas Series and outputs pandas Series. The pandas Series given to the function is of a batch used internally. - See also Transform and apply a function. - Note - the func is unable to access to the whole input series. pandas-on-Spark internally splits the input series into multiple batches and calls func with each batch multiple times. Therefore, operations such as global aggregations are impossible. See the example below. - >>> # This case does not return the length of whole frame but of the batch internally ... # used. ... def length(pser) -> ps.Series[int]: ... return pd.Series([len(pser)] * len(pser)) ... >>> df = ps.DataFrame({'A': range(1000)}) >>> df.A.pandas_on_spark.transform_batch(length) c0 0 83 1 83 2 83 ... - Note - this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting. - To avoid this, specify return type in - func, for instance, as below:- >>> def plus_one(x) -> ps.Series[int]: ... return x + 1 - Parameters
- funcfunction
- Function to apply to each pandas frame. 
- *args
- Positional arguments to pass to func. 
- **kwargs
- Keyword arguments to pass to func. 
 
- Returns
- DataFrame
 
 - See also - DataFrame.pandas_on_spark.apply_batch
- Similar but it takes pandas DataFrame as its internal batch. 
 - Examples - >>> df = ps.DataFrame([(1, 2), (3, 4), (5, 6)], columns=['A', 'B']) >>> df A B 0 1 2 1 3 4 2 5 6 - >>> def plus_one_func(pser) -> ps.Series[np.int64]: ... return pser + 1 >>> df.A.pandas_on_spark.transform_batch(plus_one_func) 0 2 1 4 2 6 Name: A, dtype: int64 - You can also omit the type hints so pandas-on-Spark infers the return schema as below: - >>> df.A.pandas_on_spark.transform_batch(lambda pser: pser + 1) 0 2 1 4 2 6 Name: A, dtype: int64 - You can also specify extra arguments. - >>> def plus_one_func(pser, a, b, c=3) -> ps.Series[np.int64]: ... return pser + a + b + c >>> df.A.pandas_on_spark.transform_batch(plus_one_func, 1, b=2) 0 7 1 9 2 11 Name: A, dtype: int64 - You can also use - np.ufuncand built-in functions as input.- >>> df.A.pandas_on_spark.transform_batch(np.add, 10) 0 11 1 13 2 15 Name: A, dtype: int64 - >>> (df * -1).A.pandas_on_spark.transform_batch(abs) 0 1 1 3 2 5 Name: A, dtype: int64