pyspark.pandas.DataFrame.select_dtypes

DataFrame.select_dtypes(include: Union[str, List[str], None] = None, exclude: Union[str, List[str], None] = None) → pyspark.pandas.frame.DataFrame

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters
include, excludescalar or list-like

A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied. It also takes Spark SQL DDL type strings, for instance, ‘string’ and ‘date’.

Returns
DataFrame

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Raises
ValueError
  • If both of include and exclude are empty

    >>> df = ps.DataFrame({'a': [1, 2] * 3,
    ...                    'b': [True, False] * 3,
    ...                    'c': [1.0, 2.0] * 3})
    >>> df.select_dtypes()
    Traceback (most recent call last):
    ...
    ValueError: at least one of include or exclude must be nonempty
    
  • If include and exclude have overlapping elements

    >>> df = ps.DataFrame({'a': [1, 2] * 3,
    ...                    'b': [True, False] * 3,
    ...                    'c': [1.0, 2.0] * 3})
    >>> df.select_dtypes(include='a', exclude='a')
    Traceback (most recent call last):
    ...
    ValueError: include and exclude overlap on {'a'}
    

Notes

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

Examples

>>> df = ps.DataFrame({'a': [1, 2] * 3,
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3,
...                    'd': ['a', 'b'] * 3}, columns=['a', 'b', 'c', 'd'])
>>> df
   a      b    c  d
0  1   True  1.0  a
1  2  False  2.0  b
2  1   True  1.0  a
3  2  False  2.0  b
4  1   True  1.0  a
5  2  False  2.0  b
>>> df.select_dtypes(include='bool')
       b
0   True
1  False
2   True
3  False
4   True
5  False
>>> df.select_dtypes(include=['float64'], exclude=['int'])
     c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
>>> df.select_dtypes(include=['int'], exclude=['float64'])
   a
0  1
1  2
2  1
3  2
4  1
5  2
>>> df.select_dtypes(exclude=['int'])
       b    c  d
0   True  1.0  a
1  False  2.0  b
2   True  1.0  a
3  False  2.0  b
4   True  1.0  a
5  False  2.0  b

Spark SQL DDL type strings can be used as well.

>>> df.select_dtypes(exclude=['string'])
   a      b    c
0  1   True  1.0
1  2  False  2.0
2  1   True  1.0
3  2  False  2.0
4  1   True  1.0
5  2  False  2.0