pyspark.pandas.DataFrame.nsmallest¶
-
DataFrame.
nsmallest
(n: int, columns: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]]], keep: str = 'first') → pyspark.pandas.frame.DataFrame¶ Return the first n rows ordered by columns in ascending order.
Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.
This method is equivalent to
df.sort_values(columns, ascending=True).head(n)
, but more performant. In pandas-on-Spark, thanks to Spark’s lazy execution and query optimizer, the two would have same performance.- Parameters
- nint
Number of items to retrieve.
- columnslist or str
Column name or names to order by.
- keep{‘first’, ‘last’}, default ‘first’. ‘all’ is not implemented yet.
Determines which duplicates (if any) to keep. -
first
: Keep the first occurrence. -last
: Keep the last occurrence.
- Returns
- DataFrame
See also
DataFrame.nlargest
Return the first n rows ordered by columns in descending order.
DataFrame.sort_values
Sort DataFrame by the values.
DataFrame.head
Return the first n rows without re-ordering.
Examples
>>> df = ps.DataFrame({'X': [1, 2, 3, 5, 6, 7, np.nan], ... 'Y': [6, 7, 8, 9, 10, 11, 12]}) >>> df X Y 0 1.0 6 1 2.0 7 2 3.0 8 3 5.0 9 4 6.0 10 5 7.0 11 6 NaN 12
In the following example, we will use
nsmallest
to select the three rows having the smallest values in column “X”.>>> df.nsmallest(n=3, columns='X') X Y 0 1.0 6 1 2.0 7 2 3.0 8
To order by the smallest values in column “Y” and then “X”, we can specify multiple columns like in the next example.
>>> df.nsmallest(n=3, columns=['Y', 'X']) X Y 0 1.0 6 1 2.0 7 2 3.0 8
The examples below show how ties are resolved, which is decided by keep.
>>> tied_df = ps.DataFrame({'X': [1, 1, 2, 2, 3]}, index=['a', 'b', 'c', 'd', 'e']) >>> tied_df X a 1 b 1 c 2 d 2 e 3
When using keep=’first’ (by default), ties are resolved in order:
>>> tied_df.nsmallest(3, 'X') X a 1 b 1 c 2
>>> tied_df.nsmallest(3, 'X', keep='first') X a 1 b 1 c 2
When using keep=’last’, ties are resolved in reverse order:
>>> tied_df.nsmallest(3, 'X', keep='last') X b 1 a 1 d 2