pyspark.pandas.groupby.GroupBy.median

GroupBy.median(numeric_only: Optional[bool] = True, accuracy: int = 10000) → FrameLike

Compute median of groups, excluding missing values.

For multiple groupings, the result index will be a MultiIndex

Note

Unlike pandas’, the median in pandas-on-Spark is an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.

Parameters
numeric_onlybool, default False

Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data.

Returns
Series or DataFrame

Median of values within each group.

Examples

>>> psdf = ps.DataFrame({'a': [1., 1., 1., 1., 2., 2., 2., 3., 3., 3.],
...                     'b': [2., 3., 1., 4., 6., 9., 8., 10., 7., 5.],
...                     'c': [3., 5., 2., 5., 1., 2., 6., 4., 3., 6.]},
...                    columns=['a', 'b', 'c'],
...                    index=[7, 2, 4, 1, 3, 4, 9, 10, 5, 6])
>>> psdf
      a     b    c
7   1.0   2.0  3.0
2   1.0   3.0  5.0
4   1.0   1.0  2.0
1   1.0   4.0  5.0
3   2.0   6.0  1.0
4   2.0   9.0  2.0
9   2.0   8.0  6.0
10  3.0  10.0  4.0
5   3.0   7.0  3.0
6   3.0   5.0  6.0

DataFrameGroupBy

>>> psdf.groupby('a').median().sort_index()  
       b    c
a
1.0  2.0  3.0
2.0  8.0  2.0
3.0  7.0  4.0

SeriesGroupBy

>>> psdf.groupby('a')['b'].median().sort_index()
a
1.0    2.0
2.0    8.0
3.0    7.0
Name: b, dtype: float64