pyspark.sql.DataFrameWriter.partitionBy#

DataFrameWriter.partitionBy(*cols)[source]#

Partitions the output by the given columns on the file system.

If specified, the output is laid out on the file system similar to Hive���s partitioning scheme.

New in version 1.4.0.

Changed in version 3.4.0: Supports Spark Connect.

Parameters
colsstr or list

name of columns

Examples

Write a DataFrame into a Parquet file in a partitioned manner, and read it back.

>>> import tempfile
>>> import os
>>> with tempfile.TemporaryDirectory(prefix="partitionBy") as d:
...     # Write a DataFrame into a Parquet file in a partitioned manner.
...     spark.createDataFrame(
...         [{"age": 100, "name": "Hyukjin Kwon"}, {"age": 120, "name": "Ruifeng Zheng"}]
...     ).write.partitionBy("name").mode("overwrite").format("parquet").save(d)
...
...     # Read the Parquet file as a DataFrame.
...     spark.read.parquet(d).sort("age").show()
...
...     # Read one partition as a DataFrame.
...     spark.read.parquet(f"{d}{os.path.sep}name=Hyukjin Kwon").show()
+---+-------------+
|age|         name|
+---+-------------+
|100| Hyukjin Kwon|
|120|Ruifeng Zheng|
+---+-------------+
+---+
|age|
+---+
|100|
+---+