pyspark.pandas.Series.apply

Series.apply(func: Callable, args: Sequence[Any] = (), **kwds: Any) → pyspark.pandas.series.Series

Invoke function on values of Series.

Can be a Python function that only works on the Series.

Note

this API executes the function once to infer the type which is potentially expensive, for instance, when the dataset is created after aggregations or sorting.

To avoid this, specify return type in func, for instance, as below:

>>> def square(x) -> np.int32:
...     return x ** 2

pandas-on-Spark uses return type hint and does not try to infer the type.

Parameters
funcfunction

Python function to apply. Note that type hint for return type is required.

argstuple

Positional arguments passed to func after the series value.

**kwds

Additional keyword arguments passed to func.

Returns
Series

See also

Series.aggregate

Only perform aggregating type operations.

Series.transform

Only perform transforming type operations.

DataFrame.apply

The equivalent function for DataFrame.

Examples

Create a Series with typical summer temperatures for each city.

>>> s = ps.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an argument to apply().

>>> def square(x) -> np.int64:
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional arguments and pass these additional arguments using the args keyword

>>> def subtract_custom_value(x, custom_value) -> np.int64:
...     return x - custom_value
>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments and pass these arguments to apply

>>> def add_custom_values(x, **kwargs) -> np.int64:
...     for month in kwargs:
...         x += kwargs[month]
...     return x
>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library

>>> def numpy_log(col) -> np.float64:
...     return np.log(col)
>>> s.apply(numpy_log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64

You can omit the type hint and let pandas-on-Spark infer its type.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64