What is a dbt seed and how do I use it?

A dbt seed is a static dataset—typically CSV files—managed by dbt and loaded into your data warehouse as tables during dbt runs.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.

Description

What Is a dbt Seed?

A dbt seed is a collection of static data files—most commonly CSVs—placed in a project’s data/ directory and ingested into your target data warehouse as tables when you run dbt seed (or include the +seed selector in a dbt run). These tables can then be referenced in downstream models just like any other table in the warehouse.

Why Are dbt Seeds Important?

Seeding solves problems that arise when you need look-up tables, small reference datasets, or fixture data that rarely changes but must live inside the warehouse so your transforms remain SQL-only and source-controlled. Without seeds you would have to manage external scripts or ETL jobs for these tiny datasets—adding fragility and cognitive load. By version-controlling seeds alongside models, your entire transformation graph stays transparent, reproducible, and CI/CD-friendly.

How dbt Seeds Work

1. Directory Structure

Place CSV files in models/data/ or data/. dbt automatically discovers them. Each file name becomes the table name (minus the extension) unless overridden.

2. YAML Configuration

Create or extend a dbt_project.yml or nested .yml with seeds: blocks to set:

  • schema – where the seed table lives (analytics, seed, etc.)
  • database – override database if necessary
  • quote_columns – quote column names (handy for mixed-case)
  • column_types – force warehouse-specific types
  • full_refresh – overwrite instead of incremental insert

3. Seeding Workflow

dbt seed creates or replaces tables using the warehouse adapter’s bulk-load path (COPY on Snowflake, INSERT on BigQuery, etc.). Seeds respect dbt run-operation hooks and appear in dbt docs like models, aiding lineage tracking.

Best Practices

  • Keep files small—under ~50 MB—so they load quickly in CI. Large static datasets belong in an external stage or source table.
  • Pin data by version; treat seeds as code. Updates should go through PR review so downstream changes are explicit.
  • Type explicitly with column_types to avoid adapter defaults that may differ across environments.
  • Limit schema churn; deleting or renaming seeds invalidates references. Use deprecation patterns when possible.

Common Mistakes and How to Fix Them

  1. Forgetting to quote text columns on Redshift. Redshift defaults to varchar(256) and silently truncates. Fix by setting column_types or quote_columns: true.
  2. Committing huge CSVs. Large binary diffs bloat the repo and slow CI. Store big files externally and load via external tables or stage + model.
  3. Using seeds for rapidly changing data. Every change requires a full repo build. If the dataset updates more than weekly, treat it as a source not a seed.

Practical End-to-End Example

Assume we need a country code look-up table.

# project layout
my_dbt_project/
├── data/
│ └── iso_countries.csv
└── models/
└── my_model.sql
# .yml config inside models/iso_countries.yml
version: 2
seeds:
my_dbt_project:
iso_countries:
schema: reference
column_types:
alpha2: string
name: string
-- models/my_model.sql
SELECT
o.order_id,
c.name AS country_name
FROM {{ ref('orders') }} o
JOIN {{ ref('iso_countries') }} c
ON o.ship_country_code = c.alpha2;

Running dbt seed && dbt run will:

  1. Load iso_countries.csv into reference.iso_countries.
  2. Compile my_model, replacing {{ ref('iso_countries') }} with the fully-qualified seed table.
  3. Create lineage docs showing the dependency.

dbt Seed vs. Other Static-Data Approaches

Some teams keep look-ups in application code or spreadsheets. Those approaches break lineage, introduce dual sources of truth, and are not SQL-native. Others run bespoke ETL jobs that add maintenance. dbt seeds centralize the data exactly where transformations occur, and maintain version history.

Using Galaxy with dbt Seeds

Galaxy’s modern SQL editor recognizes seeded tables once they exist in your warehouse’s information schema. Auto-complete surfaces reference.iso_countries, and Galaxy’s AI copilot can explain or refactor queries that reference seeds. You still manage the seed files in Git and run dbt seed via CI, but day-to-day exploration and collaboration happen inside Galaxy without copy-pasting SQL across tools.

Conclusion

dbt seeds offer a lightweight yet powerful mechanism for storing small, rarely changing datasets in the warehouse under version control. By embracing seeds—and the best practices outlined here—you can keep transformations purely in SQL, ensure reproducibility, and simplify the mental model for everyone who touches your analytics stack.

Why dbt Seed is important

Static reference data is a critical dependency for many analytics pipelines. Without a standardized approach, teams resort to ad-hoc scripts or external spreadsheets—causing data drift and broken lineage. dbt seeds let you manage these datasets as code: version-controlled, testable, and loaded with one command. This tightens CI/CD loops, keeps transformation logic in SQL, and ensures that anyone running or reviewing a model can reproduce the full state of the database from the repository alone.

dbt Seed Example Usage


How do I configure column types for a dbt seed on Snowflake?

dbt Seed Syntax



Common Mistakes

Frequently Asked Questions (FAQs)

What command loads dbt seeds?

Run dbt seed to create or replace all CSV files in the project’s data/ directory. Use selectors like --select to limit which seeds run.

Can I reference a seed in a dbt model?

Yes. Treat a seed just like any model: {{ ref('my_seed_name') }}. dbt resolves the correct schema and table name during compilation.

How does Galaxy relate to dbt seeds?

Once seeds are loaded, Galaxy’s SQL editor auto-completes and documents those seed tables. While Galaxy doesn’t run dbt seed itself, it enhances exploration and collaboration on the resulting tables.

Should I use seeds for large datasets?

No. Seeds are best for small, infrequently changing reference data. Large or frequently updated datasets should be loaded via your primary ELT pipeline.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie
BauHealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.