pyspark.pandas.DataFrame.pivot¶

DataFrame.pivot(index: Union[Any, Tuple[Any, …], None] = None, columns: Union[Any, Tuple[Any, …], None] = None, values: Union[Any, Tuple[Any, …], None] = None) → pyspark.pandas.frame.DataFrame¶

Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation.

Parameters

indexstring, optional: Column to use to make new frame’s index. If None, uses existing index.
columnsstring: Column to use to make new frame’s columns.
valuesstring, object or a list of the previous: Column(s) to use for populating new frame’s values.

Returns

DataFrame: Returns reshaped DataFrame.

See also

DataFrame.pivot_table: Generalization of pivot that can handle duplicate values for one index/column pair.

Examples

>>> df = ps.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']},
...                   columns=['foo', 'bar', 'baz', 'zoo'])
>>> df
   foo bar  baz zoo
0  one   A    1   x
1  one   B    2   y
2  one   C    3   z
3  two   A    4   q
4  two   B    5   w
5  two   C    6   t

>>> df.pivot(index='foo', columns='bar', values='baz').sort_index()
... 
bar  A  B  C
foo
one  1  2  3
two  4  5  6

>>> df.pivot(columns='bar', values='baz').sort_index()  
bar  A    B    C
1.0  NaN  NaN
NaN  2.0  NaN
NaN  NaN  3.0
4.0  NaN  NaN
NaN  5.0  NaN
NaN  NaN  6.0

Notice that, unlike pandas raises an ValueError when duplicated values are found, pandas-on-Spark’s pivot still works with its first value it meets during operation because pivot is an expensive operation and it is preferred to permissively execute over failing fast when processing large data.

>>> df = ps.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]}, columns=['foo', 'bar', 'baz'])
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

>>> df.pivot(index='foo', columns='bar', values='baz').sort_index()
... 
bar    A    B    C
foo
one  1.0  NaN  NaN
two  NaN  3.0  4.0

It also support multi-index and multi-index column. >>> df.columns = pd.MultiIndex.from_tuples([(‘a’, ‘foo’), (‘a’, ‘bar’), (‘b’, ‘baz’)])

>>> df = df.set_index(('a', 'bar'), append=True)
>>> df  
              a   b
            foo baz
  (a, bar)
0 A         one   1
1 A         one   2
2 B         two   3
3 C         two   4

>>> df.pivot(columns=('a', 'foo'), values=('b', 'baz')).sort_index()
... 
('a', 'foo')  one  two
  (a, bar)
0 A           1.0  NaN
1 A           2.0  NaN
2 B           NaN  3.0
3 C           NaN  4.0

pyspark.pandas.DataFrame.pivot_table

pyspark.pandas.DataFrame.sort_index