Ingesting Data from Operational Databases into Modern Cloud Warehouses

Galaxy Glossary

How do you efficiently move data from Postgres, MySQL, or SQL Server into ClickHouse, Snowflake, BigQuery, and Redshift?

Moving data continuously and reliably from transactional databases like Postgres, MySQL, or SQL Server into analytics warehouses such as ClickHouse, Snowflake, BigQuery, or Redshift using batch or streaming ELT pipelines.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

Overview

Data ingestion is the process of extracting data from source systems, transforming it as needed, and loading it into a destination system where it can be analyzed. When the source is an operational OLTP database (Postgres, MySQL, SQL Server, etc.) and the destination is an analytical data warehouse (ClickHouse, Snowflake, BigQuery, Redshift), the ingestion pipeline must balance freshness, reliability, cost, and scalability.

Why Move Data from OLTP to Warehouse?

Separation of Workloads

Transaction processing databases are optimized for fast, row-level updates and short queries. Warehouses use columnar storage, massively parallel processing (MPP), and cheap object storage to power large aggregations and ad-hoc analysis. Off-loading reporting queries prevents performance hits on production systems.

Advanced Analytics

Warehouses support complex joins, window functions, vectorized execution, ML integrations, and petabyte-scale data. These capabilities unlock deeper insights than are practical in an OLTP system.

Cost Efficiency

Running BI workloads against a warehouse is typically cheaper because compute and storage scale independently and cold data can sit in low-cost object storage.

How Ingestion Pipelines Work

1. Extract

Different techniques exist for exporting changes from a database:

  • Logical replication / Change Data Capture (CDC). Tools like Debezium, AWS DMS, or Azure Data Factory read the write-ahead log (WAL) or binary log to stream inserts, updates, and deletes.
  • Incremental timestamp queries. Periodic jobs (cron, Airflow, dbt cloud) query WHERE updated_at > last_run. Simple but can miss late-arriving data and adds load.
  • Full extracts. Entire tables dumped to CSV/Parquet via pg_dump, mysqldump, or native export commands; suitable for small tables or nightly loads.

2. Stage

Most warehouses ingest files staged in object storage:

  • Snowflake: internal stage or S3/GCS/Azure external stage.
  • Redshift: S3.
  • BigQuery: GCS buckets.
  • ClickHouse: S3, HDFS, or direct native protocol.

Staging allows parallel loading, retry logic, and cost-effective archival.

3. Load & Transform

The load step copies staged files into raw tables, often followed by ELT (transformations inside the warehouse using SQL or dbt):

  • Snowflake: COPY INTO my_raw_table
  • Redshift: COPY my_raw_table FROM 's3://...'
  • BigQuery: LOAD DATA or CREATE TABLE AS SELECT
  • ClickHouse: INSERT INTO my_raw_table FORMAT Parquet

After landing the raw data, dbt models or materialized views clean, join, and aggregate the data into analytics-ready schemas.

Pipeline Patterns

Batch ELT (Hourly / Daily)

Simple Airflow or cron jobs export data to files, upload to object storage, then run COPY. Good for lower-volume or latency-tolerant use cases.

Micro-Batch Streaming

Kafka Connect + Debezium streams CDC events to Kafka, then a Spark/Flink job writes every few minutes to Parquet in S3. Warehouses auto-ingest as files arrive.

Real-Time Replication

SaaS tools (Fivetran, Hevo, Stitch) or open-source (Airbyte) stream CDC directly into the warehouse with second-level latency.

Example End-to-End Flow (Postgres ➜ Snowflake)

  1. Enable logical replication on Postgres.
  2. Run Airbyte’s Postgres source connector configured for CDC.
  3. Airbyte writes parquet files to Snowflake stage @cdc_stage.
  4. Snowflake PIPE auto-ingests into raw_postgres_mytable.
  5. dbt models transform raw data into the dimensional public.mytable.

Best Practices

Design for Idempotency

Loads should be re-runnable without duplicate rows. Use primary keys and MERGE operations or deduplicate staging tables.

Track Metadata

Persist load timestamps, job versions, and source LSN/SCN positions for auditing and incremental logic.

Optimize File Sizes

Warehouses prefer large (100–250 MB) compressed files. Too many small files hurt performance and drive up cost.

Secure Data in Transit and at Rest

Use IAM roles, KMS encryption, VPC peering or PrivateLink, and rotate credentials regularly.

Monitor & Alert

Track lag, error rates, and row counts. Compare warehouse row counts to source counts to detect drift.

Common Pitfalls to Avoid

Misaligned Data Types

For example, MySQL TINYINT(1) may map to BOOLEAN in some loaders but to INTEGER in others, causing downstream errors. Always review schema mappings.

Ignoring Soft Deletes

If a table uses a deleted_at flag instead of physical deletes, CDC pipelines won’t see deletes. Incorporate soft-delete logic in transformations.

Long-Running Transactions

Large Postgres transactions retain WAL segments, ballooning disk usage and stalling replication. Keep OLTP transactions short or replicate from a replica.

Working SQL Example

-- Stage: Snowflake internal stage pointing to S3
CREATE OR REPLACE STAGE raw_s3_stage
url='s3://my-postgres-dumps/'
storage_integration = my_s3_int
file_format = (type = 'PARQUET');

-- Load: copy new files incrementally
COPY INTO raw.public.orders
FROM @raw_s3_stage/orders/
FILES = (SELECT filename FROM raw.file_tracker WHERE loaded = FALSE)
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE;

-- Mark files as loaded
UPDATE raw.file_tracker SET loaded = TRUE WHERE filename IN (...);

-- Transform: upsert into analytics schema
MERGE INTO analytics.orders AS tgt
USING (
SELECT * FROM raw.public.orders
) src
ON tgt.id = src.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

Galaxy and Ingestion Workflows

Galaxy is a developer-oriented SQL editor. While it doesn’t run ingestion pipelines itself, it shines in:

  • Authoring ELT SQL. Use Galaxy’s AI Copilot to generate MERGE or COPY statements quickly.
  • Debugging Loads. Query staging tables, inspect file metadata, and share diagnosis queries with teammates via Galaxy Collections.
  • Documenting Pipelines. Endorse tested SQL so future engineers can reuse trusted transformations.

Conclusion

Moving data from operational databases into modern warehouses is the backbone of contemporary analytics stacks. By choosing the right extraction strategy, staging efficiently, and following ELT best practices, teams can deliver fresh, reliable data for dashboards, forecasting models, and data-driven products—all while keeping source systems snappy. With tooling like Galaxy for collaborative SQL development and AI-assisted query writing, maintaining these pipelines becomes faster and less error-prone.

Why Ingesting Data from Operational Databases into Modern Cloud Warehouses is important

Analytics teams rely on current, accurate data to power dashboards, experimentation, and machine learning. If the ingestion pipeline is slow, brittle, or lossy, insights are delayed and decisions suffer. Understanding ingestion mechanics—CDC, staging, ELT—enables engineers to build resilient, cost-effective pipelines that scale with business growth.

Ingesting Data from Operational Databases into Modern Cloud Warehouses Example Usage


SELECT COUNT(*) FROM analytics.orders WHERE _loaded_at > CURRENT_TIMESTAMP - INTERVAL '5 minutes';

Common Mistakes

Frequently Asked Questions (FAQs)

What is Change Data Capture (CDC)?

CDC reads the database’s transaction log to capture every insert, update, and delete. It enables near real-time replication with minimal load on the source database.

Do I need to transform data before loading?

Not necessarily. Modern ELT philosophy recommends loading raw data first, then transforming inside the warehouse using SQL or dbt. This preserves lineage and simplifies debugging.

How does Galaxy help with ingestion workflows?

Galaxy’s AI Copilot autocompletes COPY and MERGE statements, while Collections let teams share validated ELT queries. Although Galaxy doesn’t orchestrate jobs, it accelerates writing and reviewing pipeline SQL.

Should I choose batch or streaming ingestion?

Choose based on latency, volume, and complexity. Streaming (CDC) is ideal for sub-minute freshness, but batch may suffice for daily dashboards and is cheaper to operate.

Want to learn about other SQL terms?