ParadeDB data modeling optimizes PostgreSQL schemas to store relational and vector data efficiently while keeping queries fast.
ParadeDB extends PostgreSQL with fast vector search, so your schema must support both traditional relations and high-dimensional embedding data. Proper types, constraints, and indexes keep similarity queries and joins snappy.
Start with clear PRIMARY KEY
s, FOREIGN KEY
s, and not-null columns.Use numeric surrogate keys for joins; avoid composite keys that bloat indexes.
CREATE TABLE Customers (id BIGSERIAL PRIMARY KEY, name TEXT NOT NULL, email TEXT UNIQUE NOT NULL, created_at TIMESTAMPTZ DEFAULT NOW());
Store Embeddings in VECTOR
(dim) columns provided by ParadeDB. Keep them in separate tables when sparsity or multiple embeddings per row are possible.
ALTER TABLE Products ADD COLUMN embedding VECTOR(384);
Normalize transactional data (Customers, Orders) to avoid anomalies.Denormalize read-heavy product catalogs or pre-computed recommendation tables to speed vector similarity joins.
Use CREATE INDEX ... USING pgvectors
for approximate search or ivfflat
for ANN.Always set lists
and probes
according to dataset size.
CREATE INDEX products_vec_idx ON Products USING ivfflat (embedding vector_l2_ops) WITH (lists = 100);
Wrap writes to relational and vector tables in a single transaction to keep embeddings consistent with the source row.
Use ADD COLUMN
for new embeddings, backfill in batches, then build indexes. For large tables, build indexes concurrently to prevent downtime.
Partition Orders by month or Products by category to prune similarity scans.Keep partitions under a few million rows for faster ANN indexing.
BIGINT
keysCHECK (vector_dim(embedding)=384)
ANALYZE
after bulk loadspg_stat_user_indexes
Storing vectors in JSONB: slows distance calculations. Use VECTOR
type instead.
Building ivfflat on tiny tables: linear scan is faster under 1k rows; skip ANN indexes until data grows.
.
Yes. You must plan for vector storage and ANN indexes, which affect disk layout and maintenance.
ANN indexes pay off once you exceed roughly 1,000 rows; below that, sequential scan is faster.
You can, but remote vectors slow similarity joins. Prefer local storage for high-frequency searches.