pyspark.sql.DataFrameWriter.bucketBy¶
- 
DataFrameWriter.bucketBy(numBuckets: int, col: Union[str, List[str], Tuple[str, …]], *cols: Optional[str]) → pyspark.sql.readwriter.DataFrameWriter¶ Buckets the output by the given columns. If specified, the output is laid out on the file system similar to Hive’s bucketing scheme, but with a different bucket hash function and is not compatible with Hive’s bucketing.
- Parameters
 - numBucketsint
 the number of buckets to save
- colstr, list or tuple
 a name of a column, or a list of names.
- colsstr
 additional names (optional). If col is a list it should be empty.
Notes
Applicable for file-based data sources in combination with
DataFrameWriter.saveAsTable().Examples
>>> (df.write.format('parquet') ... .bucketBy(100, 'year', 'month') ... .mode("overwrite") ... .saveAsTable('bucketed_table'))