Write Parquet Files from pandas DataFrames

Galaxy Glossary

How do I write a Parquet file from a pandas DataFrame?

Persist a pandas DataFrame to disk or cloud in the efficient, column-oriented Apache Parquet format using DataFrame.to_parquet().

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Persisting analytics datasets in Parquet is a must-have skill for modern data engineers. It combines small file counts, columnar encoding, and optional compression to slash I/O costs and speed up downstream query engines such as DuckDB, BigQuery, Redshift Spectrum, or Spark.

Why write Parquet from pandas?

Parquet is a binary, column-oriented storage format created by Twitter and Cloudera and standardized in the Apache ecosystem. Compared with CSV or JSON it offers:

  • Compression – typically 5–10× smaller files with codecs like snappy, gzip, or zstd.
  • Predicate push-down – engines can skip pages/row-groups that do not match a filter.
  • Schema evolution – add or drop columns without rewriting the whole dataset.
  • Interoperability – readable in Python, Java, C++, Rust, and every major lakehouse engine.

Prerequisites

pandas delegates Parquet I/O to either pyarrow or fastparquet. pyarrow is the de-facto standard and supports the full Parquet spec.

pip install pandas pyarrow --upgrade

Basic syntax

df.to_parquet(
path, # str, Path, or buffer
engine="pyarrow", # or "auto" (default) / "fastparquet"
compression="snappy", # snappy, gzip, brotli, zstd, None
index=False, # omit the pandas index unless you need it
)

Step-by-step example

1 · Create or load a DataFrame

import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv")

2 · Write a local Parquet file

df.to_parquet("tips.parquet", compression="zstd", index=False)

3 · Write directly to S3 (or any fsspec-compatible cloud)

import s3fs

df.to_parquet(
"s3://analytics-demo/tables/tips/partition_date=2024-06-09/tips.parquet",
storage_options={"key": "AWS_KEY", "secret": "AWS_SECRET"},
compression="snappy",
index=False,
)

Partitioned datasets

For large datasets you rarely write a single file. Instead, you build a dataset composed of many Parquet files organized by partition columns.

import pyarrow as pa
import pyarrow.dataset as ds

table = pa.Table.from_pandas(df)

ds.write_dataset(
data=table,
base_dir="s3://analytics-demo/tables/tips",
format="parquet",
partitioning=["day", "restaurant_id"], # directories like day=2024-06-09/
compression="snappy",
)

This layout lets engines like Amazon Athena or BigQuery skip entire folders when you filter by day or restaurant_id.

Best practices

Choose the right compression codec

snappy is fast but less compact; zstd strikes a good balance; gzip maximizes compression but costs CPU. Benchmark on your workload.

Keep row-groups between 64–512 MB

Large row-groups improve scan throughput but hinder parallelism. Use row_group_size (fastparquet) or max_rows_per_file (pyarrow.dataset).

Explicit data types

Convert object columns to string[pyarrow], category, or numeric dtypes to avoid wasted space and schema ambiguity.

Common mistakes & how to fix them

Missing engine dependency

Mistake: Calling df.to_parquet() without pyarrow or fastparquet installed.
Fix: pip install pyarrow

Accidentally persisting the DataFrame index

Mistake: Leaving index=True (default) creates an extra column named __index_level_0__.
Fix: Always set index=False unless the index is meaningful.

Schema drift across partitions

Mistake: Writing different partitions with mismatched dtypes (e.g. string vs int). Query engines will refuse to read.
Fix: Enforce a single schema up front (Arrow schema or pandas.dtypes) and validate before writes.

Working code example

"""Write a 1 million-row DataFrame to partitioned Parquet on local disk"""
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.dataset as ds

# Generate synthetic data
n = 1_000_000
rng = np.random.default_rng(42)
df = pd.DataFrame({
"user_id": rng.integers(1, 10_000, n, dtype=np.int32),
"event_date": pd.to_datetime(rng.integers(1_704_000_000, 1_705_000_000, n), unit="s"),
"revenue": rng.random(n) * 25,
})

# Add a partition column (yyyy-mm-dd)
df["date"] = df["event_date"].dt.date.astype("string")

# Arrow table for efficient writing
schema = pa.schema([
("user_id", pa.int32()),
("event_date", pa.timestamp("ms")),
("revenue", pa.float32()),
("date", pa.string()),
])

table = pa.Table.from_pandas(df, schema=schema, preserve_index=False)

ds.write_dataset(
table,
base_dir="./events_parquet",
format="parquet",
partitioning=["date"],
compression="zstd",
max_rows_per_file=250_000,
)

How does this relate to Galaxy?

While Galaxy is first and foremost an AI-powered SQL editor, many users query Parquet via external tables (e.g. in DuckDB, Trino, or BigQuery). Storing your pandas data in Parquet ensures those queries run fast and cheaply inside Galaxy’s editor. In the future, Galaxy’s catalog and visualization features will auto-detect Parquet datasets you publish to cloud storage, simplifying discovery and collaboration.

Key takeaways

  • Install pyarrow and call DataFrame.to_parquet() with index=False.
  • Pick an appropriate compression codec and keep row-groups in the 64–512 MB range.
  • For large datasets, use pyarrow.dataset.write_dataset() to create partitioned layouts.
  • Validate schemas to avoid type mismatches across partitions.

Why Write Parquet Files from pandas DataFrames is important

Parquet is the default storage layer for modern lakehouses and query engines. Writing DataFrames directly to Parquet lets data engineers share datasets that load orders of magnitude faster than CSV, reduce storage costs, and enable advanced features like partition pruning and schema evolution.

Write Parquet Files from pandas DataFrames Example Usage


traffic_df.to_parquet("s3://analytics/traffic/2024/05/01.parquet", compression="snappy", index=False)

Common Mistakes

Frequently Asked Questions (FAQs)

What library do I need to write Parquet in pandas?

Install pyarrow (preferred) or fastparquet. pandas will automatically detect the engine, but explicitly passing engine="pyarrow" guarantees full feature support.

How do I stop pandas from adding the index column?

Set index=False in DataFrame.to_parquet() or to_parquet(..., preserve_index=False) when using Arrow tables. Otherwise you’ll see an extra column named __index_level_0__.

Can Galaxy open or query Parquet files?

Galaxy’s SQL editor can connect to engines like DuckDB or BigQuery that read Parquet natively. Store your pandas output in Parquet, register it as an external table, and query it interactively inside Galaxy.

What partitioning strategy should I use?

Partition on low-cardinality, high-value filters such as date, region, or customer_tier. Avoid high-cardinality columns like user_id which create millions of tiny folders.

Want to learn about other SQL terms?