PyDeequ is a Python library that lets you profile, validate, and monitor data quality at scale by leveraging Apache Spark.
Data profiling is the first, most critical step in any data quality initiative. It reveals the shape, completeness, and validity of your data before you operationalize it downstream for analytics, machine learning, or reporting.
This article explains how to perform automated, repeatable data profiling with PyDeequ—a Python wrapper around Amazon’s Deequ library—so you can gain trust in your data pipelines without writing thousands of lines of custom checks.
PyDeequ is the Python API for Deequ, an open-source data-quality framework developed at Amazon. Deequ sits on top of Apache Spark, which means you can run data quality checks on terabytes of data in a distributed fashion without pulling the data into memory on a single machine. While Deequ was originally written in Scala, PyDeequ allows Python users to access the same features via py4j
bridges.
Data profiling answers fundamental questions such as:
Without automated profiling, these questions become guesswork, causing broken dashboards, failed ML models, and lost stakeholder confidence.
An Analyzer scans a DataFrame to generate metrics—think column completeness, minimum, mean, or standard deviation. PyDeequ ships with dozens of built-in analyzers.
A VerificationSuite bundles one or more data quality checks (boolean assertions) and runs them in a single pass over the data. For example, you can assert that a column’s null ratio must stay below 0.05.
A Constraint is an individual rule, such as hasSize(lambda x: x == 1000)
or isComplete("user_id")
.
A Metrics Repository persists profiling results (e.g., in Amazon S3 or JDBC) so you can perform time-series monitoring and detect drift.
pip install pydeequ
PyDeequ requires Spark 2.4+; if you don’t already have it, install the pre-built package:
pip install pyspark==3.4.1
from pyspark.sql import SparkSession
from pydeequ.analyzers import *
from pydeequ.verification import *
spark = (SparkSession
.builder
.appName("pydeequ-profile")
.config("spark.jars.packages", "com.amazon.deequ:deequ:2.0.3-spark-3.3")
.getOrCreate())
Suppose you ingest a CSV of e-commerce orders:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv("s3://my-bucket/raw/orders/2023-10-01.csv"))
Run a simple column profiler:
from pydeequ.profiles import ColumnProfilerRunner
result = (ColumnProfilerRunner(spark)
.onData(df)
.run())
for col, profile in result.profiles.items():
print(f"{col}: {profile}")
You’ll get metrics like completeness, approxDistinct, min, max, and type.
check = (Check(spark, CheckLevel.Error, "Order data checks")
.isComplete("order_id")
.isUnique("order_id")
.isContainedIn("status", ["PLACED", "SHIPPED", "DELIVERED", "RETURNED"])
.hasMin("quantity", lambda x: x >= 1)
.hasMax("quantity", lambda x: x <= 20)
.hasMean("item_price", lambda x: 0 < x < 1000)
)
verification_result = (VerificationSuite(spark)
.onData(df)
.addCheck(check)
.run())
if verification_result.status == "Success":
print("Data quality checks passed!")
else:
print("Data quality checks failed:")
for check_result in verification_result.checkResults:
print(check_result)
The suite runs all constraints in a single Spark job, producing a Success
or Failure
status and detailed messages.
A one-off profile is helpful, but over time you want to know whether quality is degrading. Use MetricsRepository
to persist results:
from pydeequ.repository import FileSystemMetricsRepository, ResultKey
from pydeequ.repository.json import FileSystemMetricsRepository
repository = FileSystemMetricsRepository(spark, "/tmp/deequ-metrics")
result_key = ResultKey(spark, dataSetDate="2023-10-01")
VerificationSuite(spark) \
.onData(df) \
.addCheck(check) \
.useRepository(repository) \
.saveOrAppendResult(result_key) \
.run()
You can later retrieve historical metrics and chart them with any BI tool.
Add PyDeequ jobs to your orchestration platform (Airflow, Dagster, etc.) so that every dataset is profiled before it hits production.
Column profiling surfaces suspicious areas; convert those insights into explicit constraints to prevent regressions.
Persisting results unlocks anomaly detection, SLA monitoring, and executive dashboards.
PyDeequ’s approxDistinct
and correlation
analyzers use HyperLogLog and approximate algorithms. Understand their error bounds before acting on the numbers.
If you run PyDeequ inside a multithreaded environment (e.g., Flask API), create a new SparkSession per request or enforce a global singleton to avoid race conditions.
Sampling can speed up early exploration, but always rerun constraints on the full dataset before promotion to production.
While Galaxy is primarily a modern SQL editor, many teams first explore data via SQL before codifying quality rules in code. Galaxy’s AI copilot can auto-generate SQL queries that mirror PyDeequ constraints—such as counting nulls or checking ranges—so analysts can validate logic quickly before handing it off to the engineering team to implement in PyDeequ. This collaboration loop tightens feedback cycles and keeps everyone aligned on what "good data" means.
By integrating PyDeequ into your data pipeline, you move from ad-hoc, manual data validation to systematic, automated data quality management—unlocking reliable analytics and happier stakeholders.
Reliable analytics and machine-learning models depend on trustworthy data. Manual spot-checks don’t scale, and traditional SQL constraints catch only syntactic errors. PyDeequ lets data engineers embed sophisticated, repeatable quality checks directly into Spark pipelines, preventing bad data from ever reaching production systems. It automates metrics collection, surfaces anomalies, and stores historical trends so you can detect drift before it hurts the business.
Yes. PyDeequ and its Scala counterpart Deequ are licensed under Apache 2.0 and available on GitHub.
Great Expectations focuses on Python/pandas workflows, while PyDeequ leverages Spark for distributed processing. Choose PyDeequ when your data volume exceeds single-machine capacity.
No, PyDeequ requires an active SparkSession because it relies on Spark’s execution engine to process data and compute metrics.
Galaxy’s AI copilot can generate exploratory SQL queries that mimic PyDeequ’s constraints, making it easier for analysts to prototype checks before engineers encode them in PyDeequ.