pyspark.pandas.DataFrame.median¶
-
DataFrame.
median
(axis: Union[int, str, None] = None, skipna: bool = True, numeric_only: bool = None, accuracy: int = 10000) → Union[int, float, bool, str, bytes, decimal.Decimal, datetime.date, datetime.datetime, None, Series]¶ Return the median of the values for the requested axis.
Note
Unlike pandas’, the median in pandas-on-Spark is an approximated median based upon approximate percentile computation because computing median across a large dataset is extremely expensive.
- Parameters
- axis{index (0), columns (1)}
Axis for the function to be applied on.
- skipnabool, default True
Exclude NA/null values when computing the result.
Supported including NA/null values.
- numeric_onlybool, default None
Include only float, int, boolean columns. False is not supported. This parameter is mainly for pandas compatibility.
- accuracyint, optional
Default accuracy of approximation. Larger value means better accuracy. The relative error can be deduced by 1.0 / accuracy.
- Returns
- medianscalar or Series
Examples
>>> df = ps.DataFrame({ ... 'a': [24., 21., 25., 33., 26.], 'b': [1, 2, 3, 4, 5]}, columns=['a', 'b']) >>> df a b 0 24.0 1 1 21.0 2 2 25.0 3 3 33.0 4 4 26.0 5
On a DataFrame:
>>> df.median() a 25.0 b 3.0 dtype: float64
On a Series:
>>> df['a'].median() 25.0 >>> (df['b'] + 100).median() 103.0
For multi-index columns,
>>> df.columns = pd.MultiIndex.from_tuples([('x', 'a'), ('y', 'b')]) >>> df x y a b 0 24.0 1 1 21.0 2 2 25.0 3 3 33.0 4 4 26.0 5
On a DataFrame:
>>> df.median() x a 25.0 y b 3.0 dtype: float64
>>> df.median(axis=1) 0 12.5 1 11.5 2 14.0 3 18.5 4 15.5 dtype: float64
On a Series:
>>> df[('x', 'a')].median() 25.0 >>> (df[('y', 'b')] + 100).median() 103.0