pyspark.sql.DataFrame.repartition

DataFrame.repartition(numPartitions: Union[int, ColumnOrName], *cols: ColumnOrName) → DataFrame

Returns a new DataFrame partitioned by the given partitioning expressions. The resulting DataFrame is hash partitioned.

Parameters
numPartitionsint

can be an int to specify the target number of partitions or a Column. If it is a Column, it will be used as the first partitioning column. If not specified, the default number of partitions is used.

colsstr or Column

partitioning columns.

Added optional arguments to specify the partitioning columns. Also made numPartitions optional if partitioning columns are specified.

Examples

>>> df.repartition(10).rdd.getNumPartitions()
10
>>> data = df.union(df).repartition("age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  5|  Bob|
|  2|Alice|
|  5|  Bob|
+---+-----+
>>> data = data.repartition(7, "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  2|Alice|
|  5|  Bob|
|  2|Alice|
|  5|  Bob|
+---+-----+
>>> data.rdd.getNumPartitions()
7
>>> data = data.repartition(3, "name", "age")
>>> data.show()
+---+-----+
|age| name|
+---+-----+
|  5|  Bob|
|  5|  Bob|
|  2|Alice|
|  2|Alice|
+---+-----+