Beginners Resources

Vector Databases Explained

Welcome to Galaxy!
You'll be receiving a confirmation email.

In the meantime, follow us on Twitter
Oops! Something went wrong while submitting the form.

This guide demystifies vector databases—why they matter, how they work, and when to use them. You’ll learn core concepts like embeddings and approximate-nearest-neighbor indexing, build a hands-on semantic-search demo, and discover best practices plus tips for querying pgvector straight from Galaxy’s SQL editor.

Table of Contents

Learning Objectives

  • Define what a vector database is and how it differs from traditional databases.
  • Explain key concepts such as embeddings, similarity metrics, and ANN (Approximate-Nearest-Neighbor) indexing.
  • Identify real-world use cases where vector storage outperforms classic SQL tables.
  • Build a working semantic-search prototype using pgvector in PostgreSQL.
  • Query vector data with SQL and visualize results inside Galaxy.

1. Background: From Relational Rows to High-Dimensional Vectors

1.1 What is a Vector?

A vector is an ordered list of numbers that represents an item’s position in a multi-dimensional space. In machine learning, vectors typically come from an embedding model—for example, turning a sentence like “Galaxy is a modern SQL editor” into a 1 × 1536 float array.

1.2 Why Store Vectors?

Vectors let us compare complex objects by geometric distance instead of exact equality. That unlocks applications such as:

  • Semantic Search – Find documents with similar meaning, even if they share no keywords.
  • Product Recommendation – Suggest items whose embeddings are near each other in taste space.
  • RAG (Retriever-Augmented Generation) – Ground LLM answers in your private knowledge base.
  • Anomaly Detection – Spot outliers far from the normal cluster.

2. Core Concepts of Vector Databases

2.1 Embeddings

An embedding function (BERT, OpenAI, Cohere, etc.) maps raw data—text, images, audio—to dense vectors. The choice of model affects the quality and dimensionality (e.g., 384 vs 1536 dimensions).

2.2 Similarity Metrics

Common distance functions include:

  • cosine – angle between vectors (scale-invariant).
  • euclidean – straight-line (L2) distance.
  • dot_product – unnormalized similarity (good for transformers).

Your database must support the metric you plan to query with.

2.3 Approximate-Nearest-Neighbor (ANN) Indexes

Brute-force scanning every vector is O(N · d). ANN indexes such as HNSW (Hierarchical Navigable Small World) or IVF (Inverted File) trade tiny accuracy loss for 100-1000× speedups—critical when your table has millions of rows.

2.4 Data Model & Schemas

A vector database typically stores:

id UUID / BIGINT
embedding VECTOR(1536)
metadata JSONB (title, url, etc.)

Metadata enables post-filtering (e.g., date range, user_id) before similarity search.

3. When to Use a Vector Database

Ask two questions:

  1. Do I need similarity search beyond exact matches?
  2. Will the number of vectors or QPS (queries per second) exceed brute-force feasibility?

If yes to either, a vector index—inside Postgres or a purpose-built system like Pinecone, Weaviate, or Milvus—pays off.

4. Hands-On: Building Semantic Search with Postgres + pgvector

The pgvector extension adds a vector column type and ANN indexes to PostgreSQL, letting you stay inside a familiar relational stack.

Step 1 – Install pgvector

# Homebrew example (macOS)
brew install pgvector
psql -d mydb -c "CREATE EXTENSION IF NOT EXISTS vector;"

Step 2 – Create the Table

CREATE TABLE documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536)
);

-- Add an HNSW index for fast cosine search
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);

Step 3 – Generate & Insert Embeddings (Python)

import openai, psycopg2, json
openai.api_key = "YOUR_OPENAI_KEY"
conn = psycopg2.connect(dsn="postgres://@localhost/mydb")
cur = conn.cursor()

texts = ["Galaxy boosts SQL productivity", "Vector DB powers semantic search"]
for t in texts:
resp = openai.Embedding.create(model="text-embedding-3-small", input=t)
vec = resp["data"][0]["embedding"]
cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)", (t, vec))
conn.commit()

Step 4 – Query by Similarity

-- 1. Obtain the query vector (Python or your LLM of choice)
SELECT content, embedding <=> '[0.123, 0.456, ...]' AS distance
FROM documents
ORDER BY distance LIMIT 5;

<=> is pgvector’s operator for cosine distance. Smaller values = closer matches.

Exercise 1 – Try It Yourself

  1. Load 1,000 news headlines into documents.
  2. Run a similarity search for “latest AI breakthroughs.”
  3. How many results still make sense if you set LIMIT 1? Why?

5. Best Practices & Common Pitfalls

  • Normalize vectors (unit length) before cosine search.
  • Batch inserts—avoid one row per network call.
  • Use metadata filters to pre-restrict candidate set; keeps recall high.
  • Re-index periodically if you load millions of new vectors.
  • Security—vectors can leak info; apply row-level permissions like any sensitive data.

6. Integrating Vector Queries with Galaxy

Because pgvector lives inside PostgreSQL, you can connect your database to Galaxy in seconds. Benefits:

  • Autocompletion for vector columns and operators (<=>, <#>).
  • Parameterized snippets—store a query template that accepts a JSON array embedding.
  • AI Copilot—generate SQL that joins similarity scores with metadata, even across schemas.
  • Collections—endorse a “semantic_search_base” query so teammates can safely reuse it without touching dense math.

-- Galaxy Snippet: Top K similar documents
SELECT id, content, embedding <=> :query_vec AS distance
FROM documents
WHERE created_at > NOW() - INTERVAL '30 days'
ORDER BY distance
LIMIT :k;

You can share the snippet via a Galaxy Collection—no more copying raw SQL into Slack.

7. Key Takeaways & Next Steps

  • Vector databases excel at similarity search for high-dimensional data.
  • You don’t always need a separate product—pgvector brings vectors to Postgres.
  • ANN indexes (HNSW, IVF) are the secret sauce for millisecond-level queries.
  • Galaxy’s modern SQL editor is a perfect front-end to experiment with vector queries.
  • Explore dedicated engines (Pinecone, Milvus, Weaviate) if you outgrow single-node Postgres.

Next Steps: Add embeddings for your documentation repo, create a semantic “search the docs” route, and watch user support tickets drop.

Check out some other beginners resources