pyspark.pandas.DataFrame.groupby

DataFrame.groupby(by: Union[Any, Tuple[Any, …], Series, List[Union[Any, Tuple[Any, …], Series]]], axis: Union[int, str] = 0, as_index: bool = True, dropna: bool = True) → DataFrameGroupBy

Group DataFrame or Series using a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters
bySeries, label, or list of labels

Used to determine the groups for the groupby. If Series is passed, the Series or dict VALUES will be used to determine the groups. A label or list of labels may be passed to group by the columns in self.

axisint, default 0 or ‘index’

Can only be set to 0 at the moment.

as_indexbool, default True

For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

dropnabool, default True

If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

Returns
DataFrameGroupBy or SeriesGroupBy

Depends on the calling object and returns groupby object that contains information about the groups.

See also

pyspark.pandas.groupby.GroupBy

Examples

>>> df = ps.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]},
...                   columns=['Animal', 'Max Speed'])
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean().sort_index()  
        Max Speed
Animal
Falcon      375.0
Parrot       25.0
>>> df.groupby(['Animal'], as_index=False).mean().sort_values('Animal')
... 
   Animal  Max Speed
...Falcon      375.0
...Parrot       25.0

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True:

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = ps.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5
>>> df.groupby(by=["b"], dropna=False).sum().sort_index()  
     a  c
b
1.0  2  3
2.0  2  5
NaN  1  4