pyspark.pandas.get_dummies

pyspark.pandas.get_dummies(data: Union[pyspark.pandas.frame.DataFrame, pyspark.pandas.series.Series], prefix: Union[str, List[str], Dict[str, str], None] = None, prefix_sep: str = '_', dummy_na: bool = False, columns: Union[Any, Tuple[Any, …], List[Union[Any, Tuple[Any, …]]], None] = None, sparse: bool = False, drop_first: bool = False, dtype: Union[str, numpy.dtype, pandas.core.dtypes.base.ExtensionDtype, None] = None) → pyspark.pandas.frame.DataFrame

Convert categorical variable into dummy/indicator variables, also known as one hot encoding.

Parameters
dataarray-like, Series, or DataFrame
prefixstring, list of strings, or dict of strings, default None

String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

prefix_sepstring, default ‘_’

If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

dummy_nabool, default False

Add a column to indicate NaNs, if False NaNs are ignored.

columnslist-like, default None

Column names in the DataFrame to be encoded. If columns is None then all the columns with object or category dtype will be converted.

sparsebool, default False

Whether the dummy-encoded columns should be be backed by a SparseArray (True) or a regular NumPy array (False). In pandas-on-Spark, this value must be “False”.

drop_firstbool, default False

Whether to get k-1 dummies out of k categorical levels by removing the first level.

dtypedtype, default np.uint8

Data type for new columns. Only a single dtype is allowed.

Returns
dummiesDataFrame

Examples

>>> s = ps.Series(list('abca'))
>>> ps.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
>>> df = ps.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
...                    'C': [1, 2, 3]},
...                   columns=['A', 'B', 'C'])
>>> ps.get_dummies(df, prefix=['col1', 'col2'])
   C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1       1       0       0       1       0
1  2       0       1       1       0       0
2  3       1       0       0       0       1
>>> ps.get_dummies(ps.Series(list('abcaa')))
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
4  1  0  0
>>> ps.get_dummies(ps.Series(list('abcaa')), drop_first=True)
   b  c
0  0  0
1  1  0
2  0  1
3  0  0
4  0  0
>>> ps.get_dummies(ps.Series(list('abc')), dtype=float)
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0