Persist a pandas DataFrame to disk or cloud in the efficient, column-oriented Apache Parquet format using DataFrame.to_parquet().
Persisting analytics datasets in Parquet is a must-have skill for modern data engineers. It combines small file counts, columnar encoding, and optional compression to slash I/O costs and speed up downstream query engines such as DuckDB, BigQuery, Redshift Spectrum, or Spark.
Parquet is a binary, column-oriented storage format created by Twitter and Cloudera and standardized in the Apache ecosystem. Compared with CSV or JSON it offers:
snappy
, gzip
, or zstd
.pandas delegates Parquet I/O to either pyarrow
or fastparquet
. pyarrow
is the de-facto standard and supports the full Parquet spec.
pip install pandas pyarrow --upgrade
df.to_parquet(
path, # str, Path, or buffer
engine="pyarrow", # or "auto" (default) / "fastparquet"
compression="snappy", # snappy, gzip, brotli, zstd, None
index=False, # omit the pandas index unless you need it
)
import pandas as pd
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")
df.to_parquet("tips.parquet", compression="zstd", index=False)
import s3fs
df.to_parquet(
"s3://analytics-demo/tables/tips/partition_date=2024-06-09/tips.parquet",
storage_options={"key": "AWS_KEY", "secret": "AWS_SECRET"},
compression="snappy",
index=False,
)
For large datasets you rarely write a single file. Instead, you build a dataset composed of many Parquet files organized by partition columns.
import pyarrow as pa
import pyarrow.dataset as ds
table = pa.Table.from_pandas(df)
ds.write_dataset(
data=table,
base_dir="s3://analytics-demo/tables/tips",
format="parquet",
partitioning=["day", "restaurant_id"], # directories like day=2024-06-09/
compression="snappy",
)
This layout lets engines like Amazon Athena or BigQuery skip entire folders when you filter by day
or restaurant_id
.
snappy
is fast but less compact; zstd
strikes a good balance; gzip
maximizes compression but costs CPU. Benchmark on your workload.
Large row-groups improve scan throughput but hinder parallelism. Use row_group_size
(fastparquet) or max_rows_per_file
(pyarrow.dataset).
Convert object
columns to string[pyarrow]
, category
, or numeric dtypes to avoid wasted space and schema ambiguity.
Mistake: Calling df.to_parquet()
without pyarrow
or fastparquet
installed.
Fix: pip install pyarrow
Mistake: Leaving index=True
(default) creates an extra column named __index_level_0__
.
Fix: Always set index=False
unless the index is meaningful.
Mistake: Writing different partitions with mismatched dtypes (e.g. string vs int). Query engines will refuse to read.
Fix: Enforce a single schema up front (Arrow schema or pandas.dtypes) and validate before writes.
"""Write a 1 million-row DataFrame to partitioned Parquet on local disk"""
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.dataset as ds
# Generate synthetic data
n = 1_000_000
rng = np.random.default_rng(42)
df = pd.DataFrame({
"user_id": rng.integers(1, 10_000, n, dtype=np.int32),
"event_date": pd.to_datetime(rng.integers(1_704_000_000, 1_705_000_000, n), unit="s"),
"revenue": rng.random(n) * 25,
})
# Add a partition column (yyyy-mm-dd)
df["date"] = df["event_date"].dt.date.astype("string")
# Arrow table for efficient writing
schema = pa.schema([
("user_id", pa.int32()),
("event_date", pa.timestamp("ms")),
("revenue", pa.float32()),
("date", pa.string()),
])
table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)
ds.write_dataset(
table,
base_dir="./events_parquet",
format="parquet",
partitioning=["date"],
compression="zstd",
max_rows_per_file=250_000,
)
While Galaxy is first and foremost an AI-powered SQL editor, many users query Parquet via external tables (e.g. in DuckDB, Trino, or BigQuery). Storing your pandas data in Parquet ensures those queries run fast and cheaply inside Galaxy’s editor. In the future, Galaxy’s catalog and visualization features will auto-detect Parquet datasets you publish to cloud storage, simplifying discovery and collaboration.
pyarrow
and call DataFrame.to_parquet()
with index=False
.pyarrow.dataset.write_dataset()
to create partitioned layouts.Parquet is the default storage layer for modern lakehouses and query engines. Writing DataFrames directly to Parquet lets data engineers share datasets that load orders of magnitude faster than CSV, reduce storage costs, and enable advanced features like partition pruning and schema evolution.
Install pyarrow
(preferred) or fastparquet
. pandas will automatically detect the engine, but explicitly passing engine="pyarrow"
guarantees full feature support.
Set index=False
in DataFrame.to_parquet()
or to_parquet(..., preserve_index=False)
when using Arrow tables. Otherwise you’ll see an extra column named __index_level_0__
.
Galaxy’s SQL editor can connect to engines like DuckDB or BigQuery that read Parquet natively. Store your pandas output in Parquet, register it as an external table, and query it interactively inside Galaxy.
Partition on low-cardinality, high-value filters such as date
, region
, or customer_tier
. Avoid high-cardinality columns like user_id
which create millions of tiny folders.