Use Airflow DAGs and Postgres operators to automate ParadeDB extension installation, index refreshes, and vector search jobs.
Manual index refreshes and embedding loads are error-prone. Airflow schedules ParadeDB maintenance, guarantees retries, and provides a visual audit trail, keeping search features consistent across releases.
Install paradedb
in PostgreSQL, apache-airflow[postgres]
in your environment, and add the psycopg2-binary
driver if not included. Verify versions match your Postgres major release.
In the Airflow UI, create a Postgres connection named paradedb_pg
.Provide host, port, database, user, and password. DAGs will reference this ID to run SQL against ParadeDB.
Add a PostgresOperator
task that runs CREATE EXTENSION IF NOT EXISTS paradedb;
. Run it once per database or schedule it for new schemas.
Yes.Use a second PostgresOperator
calling SELECT paradedb.refresh_index('products_vector_idx');
and set schedule_interval='0 3 * * *'
.
Store vector data in a staging table, then run INSERT INTO products (id, name, price, stock, embedding) SELECT ... FROM staging_embeddings;
in a task with autocommit=True
to avoid long transactions.
Keep each SQL statement in its own task for idempotency. Parameterize table names with Jinja templates. Enable on_failure_callback
to alert when index refreshes fail.
.
No. Standard workers with the Postgres provider can run ParadeDB SQL. Heavy embedding generation should run in KubernetesPodOperator or a Python task.
Yes. Store SQL in a sql/
directory and load it with PostgresOperator(sql="sql/refresh_products_idx.sql")
for cleaner reviews.