Moving data continuously and reliably from transactional databases like Postgres, MySQL, or SQL Server into analytics warehouses such as ClickHouse, Snowflake, BigQuery, or Redshift using batch or streaming ELT pipelines.
Data ingestion is the process of extracting data from source systems, transforming it as needed, and loading it into a destination system where it can be analyzed. When the source is an operational OLTP database (Postgres, MySQL, SQL Server, etc.) and the destination is an analytical data warehouse (ClickHouse, Snowflake, BigQuery, Redshift), the ingestion pipeline must balance freshness, reliability, cost, and scalability.
Transaction processing databases are optimized for fast, row-level updates and short queries. Warehouses use columnar storage, massively parallel processing (MPP), and cheap object storage to power large aggregations and ad-hoc analysis. Off-loading reporting queries prevents performance hits on production systems.
Warehouses support complex joins, window functions, vectorized execution, ML integrations, and petabyte-scale data. These capabilities unlock deeper insights than are practical in an OLTP system.
Running BI workloads against a warehouse is typically cheaper because compute and storage scale independently and cold data can sit in low-cost object storage.
Different techniques exist for exporting changes from a database:
WHERE updated_at > last_run
. Simple but can miss late-arriving data and adds load.pg_dump
, mysqldump
, or native export commands; suitable for small tables or nightly loads.Most warehouses ingest files staged in object storage:
Staging allows parallel loading, retry logic, and cost-effective archival.
The load step copies staged files into raw tables, often followed by ELT (transformations inside the warehouse using SQL or dbt):
COPY INTO my_raw_table
COPY my_raw_table FROM 's3://...'
LOAD DATA
or CREATE TABLE AS SELECT
INSERT INTO my_raw_table FORMAT Parquet
After landing the raw data, dbt models or materialized views clean, join, and aggregate the data into analytics-ready schemas.
Simple Airflow or cron jobs export data to files, upload to object storage, then run COPY
. Good for lower-volume or latency-tolerant use cases.
Kafka Connect + Debezium streams CDC events to Kafka, then a Spark/Flink job writes every few minutes to Parquet in S3. Warehouses auto-ingest as files arrive.
SaaS tools (Fivetran, Hevo, Stitch) or open-source (Airbyte
) stream CDC directly into the warehouse with second-level latency.
@cdc_stage
.PIPE
auto-ingests into raw_postgres_mytable
.public.mytable
.Loads should be re-runnable without duplicate rows. Use primary keys and MERGE
operations or deduplicate staging tables.
Persist load timestamps, job versions, and source LSN/SCN positions for auditing and incremental logic.
Warehouses prefer large (100–250 MB) compressed files. Too many small files hurt performance and drive up cost.
Use IAM roles, KMS encryption, VPC peering or PrivateLink, and rotate credentials regularly.
Track lag, error rates, and row counts. Compare warehouse row counts to source counts to detect drift.
For example, MySQL TINYINT(1)
may map to BOOLEAN in some loaders but to INTEGER in others, causing downstream errors. Always review schema mappings.
If a table uses a deleted_at
flag instead of physical deletes, CDC pipelines won’t see deletes. Incorporate soft-delete logic in transformations.
Large Postgres transactions retain WAL segments, ballooning disk usage and stalling replication. Keep OLTP transactions short or replicate from a replica.
-- Stage: Snowflake internal stage pointing to S3
CREATE OR REPLACE STAGE raw_s3_stage
url='s3://my-postgres-dumps/'
storage_integration = my_s3_int
file_format = (type = 'PARQUET');
-- Load: copy new files incrementally
COPY INTO raw.public.orders
FROM @raw_s3_stage/orders/
FILES = (SELECT filename FROM raw.file_tracker WHERE loaded = FALSE)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;
-- Mark files as loaded
UPDATE raw.file_tracker SET loaded = TRUE WHERE filename IN (...);
-- Transform: upsert into analytics schema
MERGE INTO analytics.orders AS tgt
USING (
SELECT * FROM raw.public.orders
) src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Galaxy is a developer-oriented SQL editor. While it doesn’t run ingestion pipelines itself, it shines in:
MERGE
or COPY
statements quickly.Moving data from operational databases into modern warehouses is the backbone of contemporary analytics stacks. By choosing the right extraction strategy, staging efficiently, and following ELT best practices, teams can deliver fresh, reliable data for dashboards, forecasting models, and data-driven products—all while keeping source systems snappy. With tooling like Galaxy for collaborative SQL development and AI-assisted query writing, maintaining these pipelines becomes faster and less error-prone.
Analytics teams rely on current, accurate data to power dashboards, experimentation, and machine learning. If the ingestion pipeline is slow, brittle, or lossy, insights are delayed and decisions suffer. Understanding ingestion mechanics—CDC, staging, ELT—enables engineers to build resilient, cost-effective pipelines that scale with business growth.
CDC reads the database’s transaction log to capture every insert, update, and delete. It enables near real-time replication with minimal load on the source database.
Not necessarily. Modern ELT philosophy recommends loading raw data first, then transforming inside the warehouse using SQL or dbt. This preserves lineage and simplifies debugging.
Galaxy’s AI Copilot autocompletes COPY and MERGE statements, while Collections let teams share validated ELT queries. Although Galaxy doesn’t orchestrate jobs, it accelerates writing and reviewing pipeline SQL.
Choose based on latency, volume, and complexity. Streaming (CDC) is ideal for sub-minute freshness, but batch may suffice for daily dashboards and is cheaper to operate.