Writing Parquet Files from a pandas DataFrame

Galaxy Glossary

How do I write a Parquet file from a pandas DataFrame?

Persisting a pandas DataFrame to disk in the efficient, columnar Parquet file format using the DataFrame.to_parquet() method and an underlying engine such as PyArrow or fastparquet.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Writing a Parquet file from a pandas DataFrame

The Parquet format is the de-facto standard for columnar data storage in modern data platforms. Learning how to efficiently write Parquet files from pandas lets you move data out of Python notebooks and into the broader analytics ecosystem with minimal friction.

What Is Parquet?

Parquet is an open-source, column-oriented file format designed for efficient data compression and encoding. It stores data by column rather than by row, enabling high-performance analytical workloads where only a subset of columns is required. Major processing engines—including Apache Spark, Trino, Hive, DuckDB, Snowflake, and BigQuery—natively read and write Parquet, making it an excellent interchange format for data engineers.

Why Write Parquet from pandas?

  • Performance: Columnar compression drastically reduces disk footprint and I/O compared to CSV.
  • Schema Preservation: Parquet stores column names, data types, and metadata.
  • Interoperability: Files can be consumed by virtually every modern analytics engine.
  • Partitioning: Directory-based partitioning strategies allow predicate push-down and selective reads.

Core API: DataFrame.to_parquet()

The to_parquet() method is available on any pandas DataFrame. Under the hood, pandas delegates work to an engine—either pyarrow (recommended) or fastparquet.

Minimal Example

import pandas as pd
import pyarrow as pa # ensures the PyArrow engine is installed

df = pd.DataFrame({
"user_id": [1, 2, 3],
"event_date": pd.to_datetime(["2024-06-10", "2024-06-11", "2024-06-12"]),
"purchase_amt": [23.99, 79.50, 12.00]
})

df.to_parquet("events.parquet", engine="pyarrow")

Engine Selection

PyArrow is the default since pandas 2.0 and supports full Parquet features (nested types, dictionary encoding, statistics). fastparquet can be faster for narrow, numeric data but lags in feature parity. Specify an engine explicitly if you have both installed:

df.to_parquet("events.parquet", engine="fastparquet")

Compression Settings

Parquet supports multiple codecs. Use compression="snappy" (default), "gzip", "brotli", or None:

df.to_parquet("events.parquet", compression="brotli", compression_level=7)

Row Groups & Cohort Size

Row groups partition data internally. A good rule of thumb is one row group per 64–128 MiB uncompressed. Explicit sizing isn’t currently exposed by to_parquet(); instead, build buffers via pyarrow.Table for advanced control.

Partitioned Datasets

When working with large DataFrames, write a directory of Parquet files partitioned by a column:

df.to_parquet(
"s3://analytics/events/",
partition_cols=["event_date"],
engine="pyarrow",
filesystem="s3fs"
)

The call creates a Hive-style layout such as event_date=2024-06-10/part-0.parquet. Spark, DuckDB, and other engines automatically discover partitions and push predicates (WHERE event_date = '2024-06-10') down to the file system, reading only necessary files.

Schema Evolution

Pandas infers column data types from the DataFrame. int64, float64, string, bool, category, datetime64[ns], and nullable variants map cleanly to Parquet. Beware of mixing Python objects (dtype object)—convert or .astype("string") before writing.

End-to-End Example: From Raw CSV to Optimized Parquet

import pandas as pd
import pyarrow as pa, pyarrow.parquet as pq

# 1 Load raw data
raw_df = pd.read_csv("raw_events.csv", parse_dates=["event_time"])

# 2 Cast columns to optimized dtypes
raw_df["event_type"] = raw_df["event_type"].astype("category")
raw_df["user_id"] = pd.to_numeric(raw_df["user_id"], downcast="unsigned")

# 3 Write partitioned Parquet with brotli compression
raw_df.to_parquet(
"events_parquet/",
partition_cols=["event_date"],
compression="brotli",
compression_level=6,
engine="pyarrow"
)

# 4 Validate round-trip
reloaded = pd.read_parquet("events_parquet/", engine="pyarrow")
assert len(reloaded) == len(raw_df)

Best Practices

  • Favor PyArrow: Enables full Parquet feature set and consistent behavior.
  • Use Nullable Dtypes: Int64, Float64, and boolean preserve NaN/None without fallback to object.
  • Compress Intelligently: snappy is fast; brotli or zstd provide higher ratios for cold data.
  • Partition by High-Cardinality Columns Carefully: Aim for < 10,000 partitions to avoid small-file overhead.
  • Validate After Write: Read a sample and compare row counts and dtypes.

Common Pitfalls

  1. Writing object Columns Directly: Leads to inefficient binary blobs. Cast to string, category, or proper numeric types first.
  2. Over-Partitioning: Too many tiny Parquet files cause metadata bloat. Group rows into 64+ MiB files.
  3. Mismatched Schema on Append: Appending files with different dtypes causes read errors. Align dtypes or use pyarrow.dataset.write_dataset with schema enforcement.

Beyond pandas: Interoperability

Apache Spark & DuckDB

Once data lands in Parquet, Spark’s spark.read.parquet() or DuckDB’s READ_PARQUET can query it instantly, enabling large-scale analytics without re-ingestion.

Loading into a SQL Warehouse for Galaxy

While Galaxy itself is a SQL editor (not a storage engine), teams often bulk-load Parquet files into analytics warehouses like Snowflake or BigQuery. After loading, you can connect Galaxy to the warehouse and query the data with AI-assisted SQL, share results, and collaborate through Collections.

Conclusion

Persisting pandas DataFrames to Parquet is a low-effort, high-impact technique that unlocks interoperability, performance, and cost savings across your analytics stack. By following the best practices above, you’ll avoid common pitfalls and set your data up for success in any downstream tool—from Spark clusters to Galaxy’s modern SQL editor.

Why Writing Parquet Files from a pandas DataFrame is important

Parquet is the universal columnar storage format for modern analytics. Knowing how to export pandas DataFrames to Parquet lets data engineers move data efficiently from Python into scalable processing engines like Spark, BigQuery, and Snowflake—all of which can be queried through tools such as Galaxy. Efficient Parquet writing reduces storage costs, accelerates query performance via column pruning and compression, and enables schema enforcement across heterogeneous systems.

Writing Parquet Files from a pandas DataFrame Example Usage


df.to_parquet("my_dataset.parquet", engine="pyarrow", compression="snappy")

Common Mistakes

Frequently Asked Questions (FAQs)

What library do I need to write Parquet from pandas?

Install either pyarrow (recommended) or fastparquet. After installation, DataFrame.to_parquet() will detect the engine automatically.

Which compression codec should I use?

snappy is the default and offers balanced speed/size. Use brotli or zstd for better compression ratios on cold data where write speed is less critical.

How do I append new data to an existing Parquet dataset?

Write new Parquet files to the same directory with matching schema. Use a partitioning scheme (e.g., by date) so new files sit in their own sub-directories, avoiding costly rewrites of existing data.

Can I query Parquet files in Galaxy?

Galaxy is a SQL editor, not a compute engine. Load Parquet files into a warehouse like Snowflake, BigQuery, or a DuckDB server, connect Galaxy to that database, and then run SQL queries with Galaxy’s AI copilot and collaboration features.

Want to learn about other SQL terms?