pyspark.pandas.DataFrame.pivot

DataFrame.pivot(index: Union[Any, Tuple[Any, …], None] = None, columns: Union[Any, Tuple[Any, …], None] = None, values: Union[Any, Tuple[Any, …], None] = None) → pyspark.pandas.frame.DataFrame

Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation.

Parameters
indexstring, optional

Column to use to make new frame’s index. If None, uses existing index.

columnsstring

Column to use to make new frame’s columns.

valuesstring, object or a list of the previous

Column(s) to use for populating new frame’s values.

Returns
DataFrame

Returns reshaped DataFrame.

See also

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

Examples

>>> df = ps.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']},
...                   columns=['foo', 'bar', 'baz', 'zoo'])
>>> df
   foo bar  baz zoo
0  one   A    1   x
1  one   B    2   y
2  one   C    3   z
3  two   A    4   q
4  two   B    5   w
5  two   C    6   t
>>> df.pivot(index='foo', columns='bar', values='baz').sort_index()
... 
bar  A  B  C
foo
one  1  2  3
two  4  5  6
>>> df.pivot(columns='bar', values='baz').sort_index()  
bar  A    B    C
0  1.0  NaN  NaN
1  NaN  2.0  NaN
2  NaN  NaN  3.0
3  4.0  NaN  NaN
4  NaN  5.0  NaN
5  NaN  NaN  6.0

Notice that, unlike pandas raises an ValueError when duplicated values are found, pandas-on-Spark’s pivot still works with its first value it meets during operation because pivot is an expensive operation and it is preferred to permissively execute over failing fast when processing large data.

>>> df = ps.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]}, columns=['foo', 'bar', 'baz'])
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4
>>> df.pivot(index='foo', columns='bar', values='baz').sort_index()
... 
bar    A    B    C
foo
one  1.0  NaN  NaN
two  NaN  3.0  4.0

It also support multi-index and multi-index column. >>> df.columns = pd.MultiIndex.from_tuples([(‘a’, ‘foo’), (‘a’, ‘bar’), (‘b’, ‘baz’)])

>>> df = df.set_index(('a', 'bar'), append=True)
>>> df  
              a   b
            foo baz
  (a, bar)
0 A         one   1
1 A         one   2
2 B         two   3
3 C         two   4
>>> df.pivot(columns=('a', 'foo'), values=('b', 'baz')).sort_index()
... 
('a', 'foo')  one  two
  (a, bar)
0 A           1.0  NaN
1 A           2.0  NaN
2 B           NaN  3.0
3 C           NaN  4.0