Persisting a pandas DataFrame to disk in the efficient, columnar Parquet file format using the DataFrame.to_parquet() method and an underlying engine such as PyArrow or fastparquet.
Writing a Parquet file from a pandas DataFrame
The Parquet format is the de-facto standard for columnar data storage in modern data platforms. Learning how to efficiently write Parquet files from pandas lets you move data out of Python notebooks and into the broader analytics ecosystem with minimal friction.
Parquet is an open-source, column-oriented file format designed for efficient data compression and encoding. It stores data by column rather than by row, enabling high-performance analytical workloads where only a subset of columns is required. Major processing engines—including Apache Spark, Trino, Hive, DuckDB, Snowflake, and BigQuery—natively read and write Parquet, making it an excellent interchange format for data engineers.
DataFrame.to_parquet()
The to_parquet()
method is available on any pandas DataFrame. Under the hood, pandas delegates work to an engine—either pyarrow
(recommended) or fastparquet
.
import pandas as pd
import pyarrow as pa # ensures the PyArrow engine is installed
df = pd.DataFrame({
"user_id": [1, 2, 3],
"event_date": pd.to_datetime(["2024-06-10", "2024-06-11", "2024-06-12"]),
"purchase_amt": [23.99, 79.50, 12.00]
})
df.to_parquet("events.parquet", engine="pyarrow")
PyArrow is the default since pandas 2.0 and supports full Parquet features (nested types, dictionary encoding, statistics). fastparquet can be faster for narrow, numeric data but lags in feature parity. Specify an engine explicitly if you have both installed:
df.to_parquet("events.parquet", engine="fastparquet")
Parquet supports multiple codecs. Use compression="snappy"
(default), "gzip"
, "brotli"
, or None
:
df.to_parquet("events.parquet", compression="brotli", compression_level=7)
Row groups partition data internally. A good rule of thumb is one row group per 64–128 MiB uncompressed. Explicit sizing isn’t currently exposed by to_parquet()
; instead, build buffers via pyarrow.Table
for advanced control.
When working with large DataFrames, write a directory of Parquet files partitioned by a column:
df.to_parquet(
"s3://analytics/events/",
partition_cols=["event_date"],
engine="pyarrow",
filesystem="s3fs"
)
The call creates a Hive-style layout such as event_date=2024-06-10/part-0.parquet
. Spark, DuckDB, and other engines automatically discover partitions and push predicates (WHERE event_date = '2024-06-10'
) down to the file system, reading only necessary files.
Pandas infers column data types from the DataFrame. int64
, float64
, string
, bool
, category
, datetime64[ns]
, and nullable variants map cleanly to Parquet. Beware of mixing Python objects (dtype object
)—convert or .astype("string")
before writing.
import pandas as pd
import pyarrow as pa, pyarrow.parquet as pq
# 1 Load raw data
raw_df = pd.read_csv("raw_events.csv", parse_dates=["event_time"])
# 2 Cast columns to optimized dtypes
raw_df["event_type"] = raw_df["event_type"].astype("category")
raw_df["user_id"] = pd.to_numeric(raw_df["user_id"], downcast="unsigned")
# 3 Write partitioned Parquet with brotli compression
raw_df.to_parquet(
"events_parquet/",
partition_cols=["event_date"],
compression="brotli",
compression_level=6,
engine="pyarrow"
)
# 4 Validate round-trip
reloaded = pd.read_parquet("events_parquet/", engine="pyarrow")
assert len(reloaded) == len(raw_df)
Int64
, Float64
, and boolean
preserve NaN
/None
without fallback to object
.snappy
is fast; brotli
or zstd
provide higher ratios for cold data.object
Columns Directly: Leads to inefficient binary blobs. Cast to string
, category
, or proper numeric types first.pyarrow.dataset.write_dataset
with schema
enforcement.Once data lands in Parquet, Spark’s spark.read.parquet()
or DuckDB’s READ_PARQUET
can query it instantly, enabling large-scale analytics without re-ingestion.
While Galaxy itself is a SQL editor (not a storage engine), teams often bulk-load Parquet files into analytics warehouses like Snowflake or BigQuery. After loading, you can connect Galaxy to the warehouse and query the data with AI-assisted SQL, share results, and collaborate through Collections.
Persisting pandas DataFrames to Parquet is a low-effort, high-impact technique that unlocks interoperability, performance, and cost savings across your analytics stack. By following the best practices above, you’ll avoid common pitfalls and set your data up for success in any downstream tool—from Spark clusters to Galaxy’s modern SQL editor.
Parquet is the universal columnar storage format for modern analytics. Knowing how to export pandas DataFrames to Parquet lets data engineers move data efficiently from Python into scalable processing engines like Spark, BigQuery, and Snowflake—all of which can be queried through tools such as Galaxy. Efficient Parquet writing reduces storage costs, accelerates query performance via column pruning and compression, and enables schema enforcement across heterogeneous systems.
Install either pyarrow
(recommended) or fastparquet
. After installation, DataFrame.to_parquet()
will detect the engine automatically.
snappy
is the default and offers balanced speed/size. Use brotli
or zstd
for better compression ratios on cold data where write speed is less critical.
Write new Parquet files to the same directory with matching schema. Use a partitioning scheme (e.g., by date) so new files sit in their own sub-directories, avoiding costly rewrites of existing data.
Galaxy is a SQL editor, not a compute engine. Load Parquet files into a warehouse like Snowflake, BigQuery, or a DuckDB server, connect Galaxy to that database, and then run SQL queries with Galaxy’s AI copilot and collaboration features.