Run vector-search SQL in dbt by enabling the ParadeDB extension, generating embeddings, and querying with cosine distance functions.
ParadeDB adds vector search to PostgreSQL, while dbt supplies tested, version-controlled transformations. Together you build semantic product discovery without leaving SQL.
Create a run-operation
macro that executes CREATE EXTENSION IF NOT EXISTS paradedb
. Run it once per environment so subsequent dbt runs can reference ParadeDB functions.
{% macro enable_paradedb() %}
{% if execute %}
{{ run_query("CREATE EXTENSION IF NOT EXISTS paradedb;") }}
{% endif %}
{% endmacro %}
Add a column of type vector
. In an incremental model call pgml.embed()
(ParadeDB’s text-to-vector UDF) when new rows arrive.
select id,
name,
price,
pgml.embed(name) as embedding -- 384-D float8[]
from {{ ref('stg_products') }}
Use vector <=> vector
or parade.cosine_distance()
inside a dbt model or ad-hoc query. Order by distance and limit results.
select p2.id,
p2.name,
parade.cosine_distance(p1.embedding, p2.embedding) as dist
from products p1
join products p2 on p1.id <> p2.id
where p1.id = 42
order by dist
limit 10;
• Materialize embedding models as incremental
to avoid re-embedding unchanged rows.
• Use cluster
or CREATE INDEX ON products USING ivfflat(embedding)
for millisecond retrieval.
• Store embedding dimension in YAML docs to catch schema drift.
pgml.embed(text)
, pgml.embed_batch(text[])
, parade.cosine_distance(vector,vector)
, vector <=>
operator, and ANN index types (ivfflat
, hnsw
).
Create dbt tests that assert embedding is not null
and that distances stay below a threshold for canonical pairs.
Yes. Bump the model’s sql_header
version or drop the table; dbt will rebuild all rows with the new embedding logic.
Yes, installing paradedb
once requires a superuser or a role with CREATE
on the database. Ordinary dbt runs can execute search queries afterward.
pgml.embed()
use?ParadeDB defaults to sentence-transformers/all-mpnet-base-v2. Override with SET pgml.model = 'model_name'
before calling the function.
Most managed providers block custom C extensions. Use a self-hosted Postgres or a provider that allows paradedb
.