pyspark.sql.DataFrameWriter.partitionBy#
- DataFrameWriter.partitionBy(*cols)[source]#
Partitions the output by the given columns on the file system.
If specified, the output is laid out on the file system similar to Hive���s partitioning scheme.
New in version 1.4.0.
Changed in version 3.4.0: Supports Spark Connect.
- Parameters
- colsstr or list
name of columns
Examples
Write a DataFrame into a Parquet file in a partitioned manner, and read it back.
>>> import tempfile >>> import os >>> with tempfile.TemporaryDirectory(prefix="partitionBy") as d: ... # Write a DataFrame into a Parquet file in a partitioned manner. ... spark.createDataFrame( ... [{"age": 100, "name": "Hyukjin Kwon"}, {"age": 120, "name": "Ruifeng Zheng"}] ... ).write.partitionBy("name").mode("overwrite").format("parquet").save(d) ... ... # Read the Parquet file as a DataFrame. ... spark.read.parquet(d).sort("age").show() ... ... # Read one partition as a DataFrame. ... spark.read.parquet(f"{d}{os.path.sep}name=Hyukjin Kwon").show() +---+-------------+ |age| name| +---+-------------+ |100| Hyukjin Kwon| |120|Ruifeng Zheng| +---+-------------+ +---+ |age| +---+ |100| +---+