pyspark.pandas.DataFrame.rank

DataFrame.rank(method: str = 'average', ascending: bool = True, numeric_only: Optional[bool] = None) → pyspark.pandas.frame.DataFrame

Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of those values.

Note

the current implementation of rank uses Spark’s Window without specifying partition specification. This leads to move all data into single partition in single machine and could cause serious performance degradation. Avoid this method against very large dataset.

Parameters
method{‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
  • average: average rank of group

  • min: lowest rank in group

  • max: highest rank in group

  • first: ranks assigned in order they appear in the array

  • dense: like ‘min’, but rank always increases by 1 between groups

ascendingboolean, default True

False for ranks by high (1) to low (N)

numeric_onlybool, optional

For DataFrame objects, rank only numeric columns if set to True.

Returns
rankssame type as caller

Examples

>>> df = ps.DataFrame({'A': [1, 2, 2, 3], 'B': [4, 3, 2, 1]}, columns=['A', 'B'])
>>> df
   A  B
0  1  4
1  2  3
2  2  2
3  3  1
>>> df.rank().sort_index()
     A    B
0  1.0  4.0
1  2.5  3.0
2  2.5  2.0
3  4.0  1.0

If method is set to ‘min’, it use lowest rank in group.

>>> df.rank(method='min').sort_index()
     A    B
0  1.0  4.0
1  2.0  3.0
2  2.0  2.0
3  4.0  1.0

If method is set to ‘max’, it use highest rank in group.

>>> df.rank(method='max').sort_index()
     A    B
0  1.0  4.0
1  3.0  3.0
2  3.0  2.0
3  4.0  1.0

If method is set to ‘dense’, it leaves no gaps in group.

>>> df.rank(method='dense').sort_index()
     A    B
0  1.0  4.0
1  2.0  3.0
2  2.0  2.0
3  3.0  1.0

If numeric_only is set to ‘True’, rank only numeric columns.

>>> df = ps.DataFrame({'A': [1, 2, 2, 3], 'B': ['a', 'b', 'd', 'c']}, columns= ['A', 'B'])
>>> df
   A  B
0  1  a
1  2  b
2  2  d
3  3  c
>>> df.rank(numeric_only=True)
     A
0  1.0
1  2.5
2  2.5
3  4.0