Self-hosting Redshift means recreating Redshift’s MPP, columnar analytics on infrastructure you control using open-source PostgreSQL and compatible extensions.
Keep data on-prem, avoid SaaS lock-in, cut long-term costs, and gain full control over versioning and security.
Combine vanilla PostgreSQL with Citus for sharding, the cstore_fdw
extension for columnar storage, and PgBouncer for connection pooling. Deploy multiple worker nodes and one coordinator.
1. Provision identical Linux hosts.
2. Install PostgreSQL 16 via package manager.
3. CREATE EXTENSION citus;
on all nodes.
4.CREATE EXTENSION cstore_fdw;
for columnar tables.
Define distributed and columnar tables that mimic DISTKEY
and SORTKEY
.
-- On coordinator
SELECT master_add_node('worker01', 5432);
SELECT master_add_node('worker02', 5432);.
CREATE TABLE customers (
id BIGINT,
name TEXT,
email TEXT,
created_at TIMESTAMPTZ
) USING columnar;
SELECT create_reference_table('customers'); -- replicate small dim table
CREATE TABLE orders (
id BIGINT,
customer_id BIGINT,
order_date DATE,
total_amount NUMERIC(14,2)
) USING columnar;
SELECT create_distributed_table('orders','customer_id');
Bulk-load CSVs or Parquet via COPY
parallelized across workers:
COPY orders FROM PROGRAM 'aws s3 cp s3://bucket/orders.csv -'
WITH (FORMAT csv, HEADER true);
1. Co-locate related tables on the same distribution key.
2. Partition large fact tables by day using CREATE TABLE ... PARTITION BY RANGE (order_date)
.
3.Periodically ANALYZE
and VACUUM (FULL, VERBOSE)
.
Run nightly VACUUM FULL
on the biggest partitions and REFRESH MATERIALIZED VIEW
on summary tables.
Add a thin translation layer: create stub functions like GETDATE()
, accept NVARCHAR
synonyms, and write views mapping Redshift’s system tables to pg_catalog
.
• Use RAID-10 NVMe for each worker.
• Enable work_mem
=64 MB, maintenance_work_mem
=2 GB.
• Set max_parallel_workers_per_gather
=4.
• Automate backups with WAL-G to S3.
1.Uneven shard distribution—always choose high-cardinality keys.
2. Forgetting replica lag—monitor pg_stat_replication
.
Add servers, run SELECT master_add_node()
, then rebalance_table_shards()
during low traffic windows.
.
All core components (PostgreSQL, Citus, cstore_fdw) are open-source. You only pay for hardware or cloud VMs.
Slightly. Batch your inserts with COPY
or micro-batching to maintain high throughput.
Yes. Export tables to Parquet or CSV and use Redshift’s COPY FROM S3
to reload.