normalize() in ParadeDB cleans and standardizes text through configurable pipelines for consistent full-text and vector search.
Inconsistent casing, accents, and stop-words hurt full-text and vector search recall. The normalize() function lets you pre-process text with a single call, ensuring every row follows the same rules before indexing or embedding.
The function accepts a text value plus an optional comma-separated pipeline string. Pipelines can include lowercase
, asciifold
, strip_accents
, remove_stopwords
, and custom steps you defined in ParadeDB.
SELECT normalize(input_text [, 'pipeline_step1, pipeline_step2']);
Product titles benefit from lowercase
and asciifold
. Customer reviews often add strip_accents
and remove_stopwords
. Test different combinations, then store the output in a dedicated column for fast retrieval.
Yes. Use an UPDATE with a WHERE filter to control scope. Wrap the call in a transaction when touching critical tables so you can roll back if the result looks wrong.
Keeping both raw and normalized versions preserves the original wording for display while optimizing search and similarity joins on the processed column. The extra storage cost is negligible for most workloads.
1) Index the normalized column with GIN
for tsvector or IVFFLAT
for vectors. 2) Normalize at write-time, not on every query. 3) Document the exact pipeline in your DDL so teammates replicate results.
Skipping a pipeline name returns the input unchanged. Always pass at least one step. Over-normalizing IDs strips essential characters; apply the function only to free-text columns.
Because normalization is destructive, keep the raw column or use logical replication to recover. Otherwise, you must reload from backups.
When executed at write-time, the overhead is negligible. Avoid calling it in SELECT lists for large result sets.
Yes. ParadeDB lets you register custom transforms in SQL or Rust. Reference the new step by name in the pipeline string.
No. Keep the raw text in a separate column or backup if you need the original.