pyspark.pandas.DataFrame.pivot¶
-
DataFrame.
pivot
(index: Union[Any, Tuple[Any, …], None] = None, columns: Union[Any, Tuple[Any, …], None] = None, values: Union[Any, Tuple[Any, …], None] = None) → pyspark.pandas.frame.DataFrame¶ Return reshaped DataFrame organized by given index / column values.
Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation.
- Parameters
- indexstring, optional
Column to use to make new frame’s index. If None, uses existing index.
- columnsstring
Column to use to make new frame’s columns.
- valuesstring, object or a list of the previous
Column(s) to use for populating new frame’s values.
- Returns
- DataFrame
Returns reshaped DataFrame.
See also
DataFrame.pivot_table
Generalization of pivot that can handle duplicate values for one index/column pair.
Examples
>>> df = ps.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two', ... 'two'], ... 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], ... 'baz': [1, 2, 3, 4, 5, 6], ... 'zoo': ['x', 'y', 'z', 'q', 'w', 't']}, ... columns=['foo', 'bar', 'baz', 'zoo']) >>> df foo bar baz zoo 0 one A 1 x 1 one B 2 y 2 one C 3 z 3 two A 4 q 4 two B 5 w 5 two C 6 t
>>> df.pivot(index='foo', columns='bar', values='baz').sort_index() ... bar A B C foo one 1 2 3 two 4 5 6
>>> df.pivot(columns='bar', values='baz').sort_index() bar A B C 0 1.0 NaN NaN 1 NaN 2.0 NaN 2 NaN NaN 3.0 3 4.0 NaN NaN 4 NaN 5.0 NaN 5 NaN NaN 6.0
Notice that, unlike pandas raises an ValueError when duplicated values are found, pandas-on-Spark’s pivot still works with its first value it meets during operation because pivot is an expensive operation and it is preferred to permissively execute over failing fast when processing large data.
>>> df = ps.DataFrame({"foo": ['one', 'one', 'two', 'two'], ... "bar": ['A', 'A', 'B', 'C'], ... "baz": [1, 2, 3, 4]}, columns=['foo', 'bar', 'baz']) >>> df foo bar baz 0 one A 1 1 one A 2 2 two B 3 3 two C 4
>>> df.pivot(index='foo', columns='bar', values='baz').sort_index() ... bar A B C foo one 1.0 NaN NaN two NaN 3.0 4.0
It also support multi-index and multi-index column. >>> df.columns = pd.MultiIndex.from_tuples([(‘a’, ‘foo’), (‘a’, ‘bar’), (‘b’, ‘baz’)])
>>> df = df.set_index(('a', 'bar'), append=True) >>> df a b foo baz (a, bar) 0 A one 1 1 A one 2 2 B two 3 3 C two 4
>>> df.pivot(columns=('a', 'foo'), values=('b', 'baz')).sort_index() ... ('a', 'foo') one two (a, bar) 0 A 1.0 NaN 1 A 2.0 NaN 2 B NaN 3.0 3 C NaN 4.0