How to Integrate ParadeDB with dbt in PostgreSQL

Galaxy Glossary

How do I integrate ParadeDB with dbt for vector search?

Run vector-search SQL in dbt by enabling the ParadeDB extension, generating embeddings, and querying with cosine distance functions.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.

Description

Table of Contents

Why combine ParadeDB and dbt?

ParadeDB adds vector search to PostgreSQL, while dbt supplies tested, version-controlled transformations. Together you build semantic product discovery without leaving SQL.

How do I install ParadeDB in a dbt project?

Create a run-operation macro that executes CREATE EXTENSION IF NOT EXISTS paradedb. Run it once per environment so subsequent dbt runs can reference ParadeDB functions.

Sample macro

{% macro enable_paradedb() %}
{% if execute %}
{{ run_query("CREATE EXTENSION IF NOT EXISTS paradedb;") }}
{% endif %}
{% endmacro %}

How do I store embeddings in a model?

Add a column of type vector. In an incremental model call pgml.embed() (ParadeDB’s text-to-vector UDF) when new rows arrive.

select id,
name,
price,
pgml.embed(name) as embedding -- 384-D float8[]
from {{ ref('stg_products') }}

How can I query similar products?

Use vector <=> vector or parade.cosine_distance() inside a dbt model or ad-hoc query. Order by distance and limit results.

select p2.id,
p2.name,
parade.cosine_distance(p1.embedding, p2.embedding) as dist
from products p1
join products p2 on p1.id <> p2.id
where p1.id = 42
order by dist
limit 10;

What are best practices?

• Materialize embedding models as incremental to avoid re-embedding unchanged rows.
• Use cluster or CREATE INDEX ON products USING ivfflat(embedding) for millisecond retrieval.
• Store embedding dimension in YAML docs to catch schema drift.

Which ParadeDB functions are available?

pgml.embed(text), pgml.embed_batch(text[]), parade.cosine_distance(vector,vector), vector <=> operator, and ANN index types (ivfflat, hnsw).

How do I run tests on embeddings?

Create dbt tests that assert embedding is not null and that distances stay below a threshold for canonical pairs.

Can I refresh embeddings after a model change?

Yes. Bump the model’s sql_header version or drop the table; dbt will rebuild all rows with the new embedding logic.

Why How to Integrate ParadeDB with dbt in PostgreSQL is important

How to Integrate ParadeDB with dbt in PostgreSQL Example Usage


-- Find similar products to order 1001's first item
WITH target AS (
    SELECT oi.product_id, pgml.embed(p.name) AS query_vec
    FROM orders o
    JOIN orderitems oi ON o.id = oi.order_id
    JOIN products p      ON p.id = oi.product_id
    WHERE o.id = 1001
    LIMIT 1
)
SELECT p.id,
       p.name,
       parade.cosine_distance(p.embedding, t.query_vec) AS distance
FROM products p
CROSS JOIN target t
ORDER BY distance
LIMIT 10;

How to Integrate ParadeDB with dbt in PostgreSQL Syntax


-- Enable extension
CREATE EXTENSION IF NOT EXISTS paradedb [WITH SCHEMA schema];

-- Embed single text value
SELECT pgml.embed('Wireless Mouse')  -- returns vector(float8[])

-- Embed batch (e.g., new product names)
SELECT pgml.embed_batch(ARRAY['Keyboard','Headset']);

-- Vector similarity operators
SELECT *
FROM products
ORDER BY embedding <=> pgml.embed('gaming mouse')  -- cosine distance
LIMIT 5;

-- Build ANN index for speed
CREATE INDEX products_embedding_idx
    ON products USING ivfflat (embedding vector_cosine_ops) WITH (lists=100);

-- dbt model config example
{{
  config(
    materialized='incremental',
    on_schema_change='append_new_columns'
  )
}}
SELECT id,
       name,
       price,
       COALESCE(embedding, pgml.embed(name)) AS embedding
FROM 
{% if is_incremental() %}
  WHERE embedding IS NULL
{% endif %};

Common Mistakes

Frequently Asked Questions (FAQs)

Do I need superuser rights?

Yes, installing paradedb once requires a superuser or a role with CREATE on the database. Ordinary dbt runs can execute search queries afterward.

Which embedding model does pgml.embed() use?

ParadeDB defaults to sentence-transformers/all-mpnet-base-v2. Override with SET pgml.model = 'model_name' before calling the function.

Can I use ParadeDB on cloud-hosted Postgres?

Most managed providers block custom C extensions. Use a self-hosted Postgres or a provider that allows paradedb.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie Logo
Bauhealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.