pyspark.pandas.DataFrame.nlargest

DataFrame.nlargest(n: int, columns: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]]], keep: str = 'first') → pyspark.pandas.frame.DataFrame

Return the first n rows ordered by columns in descending order.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant in pandas. In pandas-on-Spark, thanks to Spark’s lazy execution and query optimizer, the two would have same performance.

Parameters
nint

Number of rows to return.

columnslabel or list of labels

Column label(s) to order by.

keep{‘first’, ‘last’}, default ‘first’. ‘all’ is not implemented yet.

Determines which duplicates (if any) to keep. - first : Keep the first occurrence. - last : Keep the last occurrence.

Returns
DataFrame

The first n rows ordered by the given columns in descending order.

See also

DataFrame.nsmallest

Return the first n rows ordered by columns in ascending order.

DataFrame.sort_values

Sort DataFrame by the values.

DataFrame.head

Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

>>> df = ps.DataFrame({'X': [1, 2, 3, 5, 6, 7, np.nan],
...                    'Y': [6, 7, 8, 9, 10, 11, 12]})
>>> df
     X   Y
0  1.0   6
1  2.0   7
2  3.0   8
3  5.0   9
4  6.0  10
5  7.0  11
6  NaN  12

In the following example, we will use nlargest to select the three rows having the largest values in column “X”.

>>> df.nlargest(n=3, columns='X')
     X   Y
5  7.0  11
4  6.0  10
3  5.0   9

To order by the largest values in column “Y” and then “X”, we can specify multiple columns like in the next example.

>>> df.nlargest(n=3, columns=['Y', 'X'])
     X   Y
6  NaN  12
5  7.0  11
4  6.0  10

The examples below show how ties are resolved, which is decided by keep.

>>> tied_df = ps.DataFrame({'X': [1, 2, 2, 3, 3]}, index=['a', 'b', 'c', 'd', 'e'])
>>> tied_df
   X
a  1
b  2
c  2
d  3
e  3

When using keep=’first’ (by default), ties are resolved in order:

>>> tied_df.nlargest(3, 'X')
   X
d  3
e  3
b  2
>>> tied_df.nlargest(3, 'X', keep='first')
   X
d  3
e  3
b  2

When using keep=’last’, ties are resolved in reverse order:

>>> tied_df.nlargest(3, 'X', keep='last')
   X
e  3
d  3
c  2