Connect Apache Spark to a ParadeDB-enabled PostgreSQL instance, allowing distributed analytics and vector search in one workflow.
Integrating ParadeDB’s vector search with Spark lets you run large-scale feature engineering, similarity search, and analytics without moving data from PostgreSQL. Spark handles distributed compute while ParadeDB stores vectors alongside transactional data.
• ParadeDB installed on PostgreSQL 15+.
• Apache Spark 3.4 with the PostgreSQL JDBC driver.
• Network access between Spark executors and the database.
• A vectors extension enabled (CREATE EXTENSION paradedb;).
Use Spark’s built-in JDBC connector.Specify the table containing your vector
column so Spark can treat it as a binary field for downstream processing.
val df = spark.read
.format("jdbc")
.option("url", "jdbc:postgresql://db:5432/shop")
.option("dbtable", "products")
.option("user", "sparkuser")
.option("password", sys.env("PG_PASS"))
.load()
Wrap your ParadeDB <->
operator in a subquery.Spark pushes the filter to PostgreSQL, returning only the closest matches.
SELECT *
FROM (
SELECT id, name, price, embedding <-> ARRAY[0.12,0.88,0.55] AS distance
FROM products
ORDER BY distance
LIMIT 50
) AS t;
Yes. After processing, use DataFrame.write
with mode = "append" or "overwrite".Ensure the destination table has a compatible vector
column type.
processedDF.write
.format("jdbc")
.option("url", "jdbc:postgresql://db:5432/shop")
.option("dbtable", "recommendations")
.option("user", "sparkuser")
.option("password", sys.env("PG_PASS"))
.mode("append")
.save()
• Use small fetchSize (e.g., 1 000) to stream rows.
• Partition reads by numeric primary key to parallelize.
• Create ParadeDB ivfflat
index on the vector
column.
• Keep Spark executors close to the database to reduce latency.
• Use SSL in the JDBC URL (ssl=true&sslmode=require
).
• Restrict firewall access to Spark’s IP range.
• Store credentials in environment variables or a secrets manager.
Package the Scala/PySpark job, schedule it in Airflow, and parameterize the target vector and customer cohort.Keep connection properties in centralized configs.
Ensure the driver major version matches the PostgreSQL server; mismatches cause authentication errors. Upgrade the JAR in Spark’s classpath.
If vectors appear as text, set stringtype=unspecified
in the JDBC URL so the driver returns them as bytea, preserving dimensionality.
With a few JDBC options and ParadeDB indexing, Spark can query billions of embeddings directly in PostgreSQL, unifying analytics and ML workflows.
.
No. Vectors arrive as bytea
blobs. Cast to arrays in SQL or decode in Spark UDFs for ML processing.
Absolutely. Replace Scala code with spark.read.format("jdbc").options(...).load()
in Python. The JDBC mechanics remain identical.
Start with executors totaling the same vCPU count as your PostgreSQL cores. Scale Spark separately from the database to avoid saturating one layer.