pyspark.pandas.DataFrame.dropna¶

DataFrame.dropna(axis: Union[int, str] = 0, how: str = 'any', thresh: Optional[int] = None, subset: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, inplace: bool = False) → Optional[pyspark.pandas.frame.DataFrame]¶

Remove missing values.

Parameters

axis{0 or ‘index’}, default 0

Determine if rows or columns which contain missing values are removed.

0, or ‘index’ : Drop rows which contain missing values.

how{‘any’, ‘all’}, default ‘any’

Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

‘any’ : If any NA values are present, drop that row or column.
‘all’ : If all values are NA, drop that row or column.

threshint, optional

Require that many non-NA values.

subsetarray-like, optional

Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

inplacebool, default False

If True, do operation inplace and return None.

Returns

DataFrame: DataFrame with NA entries dropped from it.

See also

DataFrame.drop: Drop specified labels from columns.
DataFrame.isnull: Indicate missing values.
DataFrame.notnull: Indicate existing (non-missing) values.

Examples

>>> df = ps.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [None, 'Batmobile', 'Bullwhip'],
...                    "born": [None, "1940-04-25", None]},
...                   columns=['name', 'toy', 'born'])
>>> df
       name        toy        born
0    Alfred       None        None
1    Batman  Batmobile  1940-04-25
2  Catwoman   Bullwhip        None

Drop the rows where at least one element is missing.

>>> df.dropna()
     name        toy        born
1  Batman  Batmobile  1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')
       name        toy        born
0    Alfred       None        None
1    Batman  Batmobile  1940-04-25
2  Catwoman   Bullwhip        None

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)
       name        toy        born
1    Batman  Batmobile  1940-04-25
2  Catwoman   Bullwhip        None

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'born'])
     name        toy        born
1  Batman  Batmobile  1940-04-25

Keep the DataFrame with valid entries in the same variable.

>>> df.dropna(inplace=True)
>>> df
     name        toy        born
1  Batman  Batmobile  1940-04-25

pyspark.pandas.DataFrame.backfill

pyspark.pandas.DataFrame.fillna