Beginners Resources

ETL vs ELT: A Practical Guide for Modern Data Teams

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

This resource explains the core concepts, architectures, and trade-offs of ETL and ELT. You’ll compare workflows, explore real-world scenarios, and follow a hands-on tutorial (with Galaxy examples) to decide which approach fits your data strategy.

Example H2

Example H3

Learning Objectives

Define ETL and ELT and understand their historical context.
Compare performance, cost, governance, and complexity trade-offs.
Identify real-world scenarios where each pattern shines.
Build a simple ETL pipeline and an ELT workflow with step-by-step code.
Use Galaxy to prototype, transform, and govern data in an ELT workflow.

1. Introduction

Modern data teams have more storage and compute options than ever. Two strategies dominate data movement and transformation:

ETL (Extract-Transform-Load) – clean and shape data before loading it into the target system.
ELT (Extract-Load-Transform) – load raw data first, then transform in-place inside the warehouse or lake.

Which one is right? The answer depends on scale, performance, cost, compliance, and team skill sets. Let’s dig in.

2. Foundational Concepts

2.1 Data Integration 101

Every analytics pipeline follows three high-level steps:

Extract – copy data from a source (database, API, file) into a staging area.
Transform – clean, join, enrich, and aggregate the data.
Load – persist the finished data in its destination (warehouse, lake, operational DB).

2.2 What Is ETL?

In traditional ETL, the heavy lifting happens before data reaches the warehouse:

Source ➜ ETL server (transform) ➜ Data Warehouse

Pros:

Warehouse stores only curated, compact tables → lower storage cost.
Data governance enforced upstream → fewer raw PII copies.
Great fit when warehouse compute is scarce or expensive.

Cons:

Up-front schema design slows iteration.
Adding new fields or reprocessing history can be painful.
ETL servers often become brittle, black-box monoliths.

2.3 What Is ELT?

ELT flips the order, leaning on cheap cloud storage and elastic compute:

Source ➜ Data Warehouse (raw) ➜ SQL transforms

Pros:

Raw data preserved → easier replays and audit trails.
Transforms written in familiar SQL (or dbt) → faster iteration.
Separates orchestration (load) from modeling (transform).

Cons:

Warehouse storage costs may rise with raw tables.
Must implement governance/PII masking inside the warehouse.
Query performance can suffer if raw tables are misused directly.

3. ETL vs ELT: Head-to-Head Comparison

DimensionETLELTCompute locationDedicated ETL serverWarehouse/lake engineIteration speedSlower (code + deploy)Faster (SQL only)Data freshnessBatch-orientedNear-real-time possibleComplianceEasier to drop PII before loadRequires in-warehouse controlsCost patternMore compute outside warehouseHigher storage, pay-as-you-go compute

4. When to Choose ETL

Your warehouse compute is limited or priced aggressively per CPU hour.
Regulations demand that sensitive columns never hit the warehouse.
You have a mature ETL platform (Informatica, Talend) and trained staff.
Data volumes are modest and schema changes infrequent.

5. When to Choose ELT

You run Snowflake, BigQuery, Redshift, Databricks, or DuckDB—engines designed for in-warehouse transforms.
Your team prefers SQL and version control (dbt, Galaxy).
You need rapid experimentation, data lineage, and easy backfills.
Storage is cheap relative to engineering time.

6. Architectures & Tooling

6.1 Typical ETL Stack

Sources → Apache NiFi → Python/Java transforms → S3 Staging → Redshift

6.2 Typical ELT Stack

Sources → Fivetran/Airbyte (raw load) → Snowflake raw schema → dbt models → BI layer (Looker/Galaxy soon)

6.3 Where Galaxy Fits

Galaxy isn’t a data mover, but it supercharges the T in ELT:

Connect Galaxy to Snowflake or Postgres.
Write, test, and endorse transformation queries in Collections.
Share links so product managers run trusted models without editing them.
Use Galaxy’s AI Copilot to refactor, document, and optimize SQL—perfect for dbt refactor days.

7. Hands-On Tutorial

Let’s build two mini-pipelines that load GitHub issue data.

7.1 Dataset Prep

Clone the public CSV to data/issues.csv.

curl -o issues.csv https://raw.githubusercontent.com/gh-data/issues_sample/main/issues.csv

7.2 ETL Example (Python + psycopg2)

import pandas as pd import psycopg2 # 1. Extract issues = pd.read_csv('issues.csv') # 2. Transform: keep only open issues, rename columns issues = issues[issues['state'] == 'open'] issues.rename(columns={'created_at': 'created_ts'}, inplace=True) # 3. Load into Postgres conn = psycopg2.connect('postgresql://user:pwd@localhost:5432/analytics') issues.to_sql('github_issues', conn, if_exists='replace', index=False)

Notice we had to know every transformation before loading.

7.3 ELT Example (Airbyte + Galaxy)

Extract & Load# Airbyte source → Postgres raw schema: _airbyte_raw_issues
Transform in Galaxy

-- File: transforms/open_issues.sql CREATE OR REPLACE TABLE analytics.github_issues AS SELECT id, title, body, created_at AS created_ts FROM raw._airbyte_raw_issues WHERE state = 'open';

Save the query in a Collection ➜ "GitHub Models" and click Endorse so teammates trust it.

Incremental Updates

Add a simple INSERT for new rows:

INSERT INTO analytics.github_issues SELECT id, title, body, created_at FROM raw._airbyte_raw_issues r LEFT JOIN analytics.github_issues a USING(id) WHERE a.id IS NULL AND r.state = 'open';

Galaxy’s AI Copilot can generate this join automatically—try the ⌘⇧A shortcut.

7.4 Testing Your Model

Within Galaxy, run:

SELECT COUNT(*) FROM analytics.github_issues;

If the count is zero, check Airbyte sync logs or column names. Galaxy highlights error lines and offers fixes.

8. Common Pitfalls & Troubleshooting

8.1 Silent Type Mismatches

Warehouse engines may coerce strings—use ASSERT statements or dbt tests.

8.2 PII Leakage in ELT

Create secure schemas (raw_sensitive) and role-based access. Mask columns with SECURE VIEW.

8.3 Long-Running Transforms

In ETL: parallelize with Spark clusters.
In ELT: use warehouse clustering/partition keys and MERGE + incremental logic.

8.4 Version Drift

Whether ETL or ELT, store code in Git. Galaxy offers GitHub sync so every endorsed SQL file is versioned.

9. Key Takeaways

ETL pushes transforms upstream; ELT leverages warehouse compute.
ELT accelerates iteration but demands in-warehouse governance.
Cloud economics (cheap storage, elastic compute) make ELT the default for most startups.
Galaxy complements ELT by making SQL transforms discoverable, versioned, and AI-assisted.

10. Next Steps

Connect Galaxy to your analytics warehouse.
Load a raw dataset with Airbyte or Fivetran.
Write three transformation queries and save them in a Collection.
Invite a teammate as Viewer and ask for feedback.
Optional: Refactor the SQL into dbt models and automate nightly runs.