pyspark.pandas.Series.value_counts

Series.value_counts(normalize: bool = False, sort: bool = True, ascending: bool = False, bins: None = None, dropna: bool = True) → Series

Return a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.

Parameters
normalizeboolean, default False

If True then the object returned will contain the relative frequencies of the unique values.

sortboolean, default True

Sort by values.

ascendingboolean, default False

Sort in ascending order.

binsNot Yet Supported
dropnaboolean, default True

Don’t include counts of NaN.

Returns
countsSeries

See also

Series.count

Number of non-NA elements in a Series.

Examples

For Series

>>> df = ps.DataFrame({'x':[0, 0, 1, 1, 1, np.nan]})
>>> df.x.value_counts()  
1.0    3
0.0    2
Name: x, dtype: int64

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> df.x.value_counts(normalize=True)  
1.0    0.6
0.0    0.4
Name: x, dtype: float64

dropna With dropna set to False we can also see NaN index values.

>>> df.x.value_counts(dropna=False)  
1.0    3
0.0    2
NaN    1
Name: x, dtype: int64

For Index

>>> idx = ps.Index([3, 1, 2, 3, 4, np.nan])
>>> idx
Float64Index([3.0, 1.0, 2.0, 3.0, 4.0, nan], dtype='float64')
>>> idx.value_counts().sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

sort

With sort set to False, the result wouldn’t be sorted by number of count.

>>> idx.value_counts(sort=True).sort_index()
1.0    1
2.0    1
3.0    2
4.0    1
dtype: int64

normalize

With normalize set to True, returns the relative frequency by dividing all values by the sum of values.

>>> idx.value_counts(normalize=True).sort_index()
1.0    0.2
2.0    0.2
3.0    0.4
4.0    0.2
dtype: float64

dropna

With dropna set to False we can also see NaN index values.

>>> idx.value_counts(dropna=False).sort_index()  
1.0    1
2.0    1
3.0    2
4.0    1
NaN    1
dtype: int64

For MultiIndex.

>>> midx = pd.MultiIndex([['lama', 'cow', 'falcon'],
...                       ['speed', 'weight', 'length']],
...                      [[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                       [1, 1, 1, 1, 1, 2, 1, 2, 2]])
>>> s = ps.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3], index=midx)
>>> s.index  
MultiIndex([(  'lama', 'weight'),
            (  'lama', 'weight'),
            (  'lama', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'weight'),
            (   'cow', 'length'),
            ('falcon', 'weight'),
            ('falcon', 'length'),
            ('falcon', 'length')],
           )
>>> s.index.value_counts().sort_index()
(cow, length)       1
(cow, weight)       2
(falcon, length)    2
(falcon, weight)    1
(lama, weight)      3
dtype: int64
>>> s.index.value_counts(normalize=True).sort_index()
(cow, length)       0.111111
(cow, weight)       0.222222
(falcon, length)    0.222222
(falcon, weight)    0.111111
(lama, weight)      0.333333
dtype: float64

If Index has name, keep the name up.

>>> idx = ps.Index([0, 0, 0, 1, 1, 2, 3], name='pandas-on-Spark')
>>> idx.value_counts().sort_index()
0    3
1    2
2    1
3    1
Name: pandas-on-Spark, dtype: int64