This resource explains the core concepts, architectures, and trade-offs of ETL and ELT. You’ll compare workflows, explore real-world scenarios, and follow a hands-on tutorial (with Galaxy examples) to decide which approach fits your data strategy.
Modern data teams have more storage and compute options than ever. Two strategies dominate data movement and transformation:
Which one is right? The answer depends on scale, performance, cost, compliance, and team skill sets. Let’s dig in.
Every analytics pipeline follows three high-level steps:
In traditional ETL, the heavy lifting happens before data reaches the warehouse:
Source ➜ ETL server (transform) ➜ Data Warehouse
Pros:
Cons:
ELT flips the order, leaning on cheap cloud storage and elastic compute:
Source ➜ Data Warehouse (raw) ➜ SQL transforms
Pros:
Cons:
DimensionETLELTCompute locationDedicated ETL serverWarehouse/lake engineIteration speedSlower (code + deploy)Faster (SQL only)Data freshnessBatch-orientedNear-real-time possibleComplianceEasier to drop PII before loadRequires in-warehouse controlsCost patternMore compute outside warehouseHigher storage, pay-as-you-go compute
Sources → Apache NiFi → Python/Java transforms → S3 Staging → Redshift
Sources → Fivetran/Airbyte (raw load) → Snowflake raw schema → dbt models → BI layer (Looker/Galaxy soon)
Galaxy isn’t a data mover, but it supercharges the T in ELT:
Let’s build two mini-pipelines that load GitHub issue data.
Clone the public CSV to data/issues.csv
.
curl -o issues.csv https://raw.githubusercontent.com/gh-data/issues_sample/main/issues.csv
import pandas as pd
import psycopg2
# 1. Extract
issues = pd.read_csv('issues.csv')
# 2. Transform: keep only open issues, rename columns
issues = issues[issues['state'] == 'open']
issues.rename(columns={'created_at': 'created_ts'}, inplace=True)
# 3. Load into Postgres
conn = psycopg2.connect('postgresql://user:pwd@localhost:5432/analytics')
issues.to_sql('github_issues', conn, if_exists='replace', index=False)
Notice we had to know every transformation before loading.
# Airbyte source → Postgres raw schema: _airbyte_raw_issues
-- File: transforms/open_issues.sql
CREATE OR REPLACE TABLE analytics.github_issues AS
SELECT id,
title,
body,
created_at AS created_ts
FROM raw._airbyte_raw_issues
WHERE state = 'open';
Save the query in a Collection ➜ "GitHub Models" and click Endorse so teammates trust it.
Add a simple INSERT
for new rows:
INSERT INTO analytics.github_issues
SELECT id, title, body, created_at
FROM raw._airbyte_raw_issues r
LEFT JOIN analytics.github_issues a USING(id)
WHERE a.id IS NULL AND r.state = 'open';
Galaxy’s AI Copilot can generate this join automatically—try the ⌘⇧A shortcut.
Within Galaxy, run:
SELECT COUNT(*) FROM analytics.github_issues;
If the count is zero, check Airbyte sync logs or column names. Galaxy highlights error lines and offers fixes.
Warehouse engines may coerce strings—use ASSERT
statements or dbt tests.
Create secure schemas (raw_sensitive
) and role-based access. Mask columns with SECURE VIEW
.
MERGE
+ incremental logic.Whether ETL or ELT, store code in Git. Galaxy offers GitHub sync so every endorsed SQL file is versioned.