validate_data checks that vector columns managed by ParadeDB contain valid, on-dimension, non-NULL embeddings and reports any broken rows.
validate_data scans a chosen vector column, confirms every embedding has the expected dimension, detects NULL values, and returns a list of offending row IDs. This prevents silent search failures caused by malformed vectors.
Run it after bulk imports, before building a new index, or on a schedule in CI to keep your similarity search dependable and fast.
Call the function from the parade schema, passing the table name, vector column, and expected dimension. Optional parameters let you choose a sampling size or abort on the first error.
SELECT *
FROM parade.validate_data(
table_name => 'products',
column_name => 'embedding',
dimension => 384,
abort_on_error => false);
The query returns a JSON array of row IDs whose vectors are NULL, the wrong length, or not finite.
You receive a SETOF RECORD with two columns: id (primary key) and reason (text). An empty set means your data is clean.
Store the output in a temporary table and fix issues with an UPDATE or DELETE. Automate the check with a nightly scheduled job.
Yes—use sample_size to validate only N random rows for faster feedback in development. Omit the parameter to scan the full table in production.
It runs read-only and can be wrapped in a transaction so your fix-up statements follow atomically.
No—the function is strictly read-only and will never alter rows. It only returns IDs that need attention.
The function streams results and uses an index when available. Still, for multi-million-row tables schedule it off-peak or use sample_size.
Wrap validate_data in your own PL/pgSQL function and append extra WHERE clauses or checks specific to your schema.