How to Apply Best Practices in Amazon Redshift

Galaxy Glossary

What are the best practices for Amazon Redshift?

Guidelines that optimize Redshift performance, cost, and reliability.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

What schema design choices speed up Redshift?

Use star or snowflake schemas to reduce joins. Keep fact tables narrow; move infrequently used columns to dimension tables. Declare data types precisely—numeric(10,2) instead of varchar—so Redshift stores fewer bytes and scans less data.

When should I set DISTKEY and SORTKEY?

Pick a DISTKEY used in large joins (e.g., customer_id) to colocate rows on the same node. Choose a SORTKEY that appears in range filters or order-by clauses (e.g., order_date) so Redshift can skip blocks with zone maps.

How do I load data efficiently?

Load from Amazon S3 using COPY with COMPUPDATE OFF and STATUPDATE OFF for large batches. Compress files as gzip or zstd, split into 1–4 GB parts, and use the manifest option for exactly-once loading.

Why run VACUUM and ANALYZE?

Delete and update operations create deleted blocks. VACUUM reclaims space and re-sorts rows, while ANALYZE refreshes statistics so the planner chooses optimal join orders. Automate both in nightly schedules.

How can WLM prevent queue congestion?

Create separate queues for ETL, dashboards, and ad-hoc users. Assign query_group or user_group rules. Enable concurrency_scaling for bursty BI loads, and set timeout to cancel runaway queries.

What security settings are essential?

Encrypt clusters at rest with KMS, enforce SSL in JDBC strings, rotate IAM keys, and restrict inbound traffic to specific security-group CIDRs. Use role-based access control via GRANT and REVOKE.

How do I cut storage and compute costs?

Archive cold tables to Redshift Spectrum or S3 LakeHouse, use RA3 nodes with managed storage, and enable automatic snapshot retention tuning. Right-size concurrency scaling and pause development clusters overnight.

FAQ: Is DISTSTYLE ALL ever worth it?

Yes—small dimension tables (<10 GB) benefit from DISTSTYLE ALL because every node has a copy, eliminating broadcast steps in joins.

Why How to Apply Best Practices in Amazon Redshift is important

How to Apply Best Practices in Amazon Redshift Example Usage


-- Find top-spending customers last quarter
SELECT c.name,
       SUM(o.total_amount) AS spend
FROM   Customers c
JOIN   Orders o ON o.customer_id = c.id
WHERE  o.order_date BETWEEN date_trunc('quarter', current_date) - interval '1 quarter'
                       AND date_trunc('quarter', current_date) - interval '1 day'
GROUP  BY c.name
ORDER  BY spend DESC
LIMIT  20;

How to Apply Best Practices in Amazon Redshift Syntax


-- Star schema with proper keys
after DROP TABLE IF EXISTS Orders;
CREATE TABLE Orders (
  id           bigint         IDENTITY(1,1),
  customer_id  bigint         NOT NULL,
  order_date   date           NOT NULL,
  total_amount numeric(10,2)  NOT NULL
)
DISTKEY(customer_id)
SORTKEY(order_date);

-- Bulk load from S3
COPY Orders
FROM 's3://company-data/orders/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftRole'
CSV GZIP
COMPUPDATE OFF STATUPDATE OFF;

-- Maintenance
VACUUM FULL Orders;
ANALYZE Orders;

-- Workload management
ALTER WLM_QUEUE configuration 'bi' SET concurrency=5, timeout=900;

Common Mistakes

Ignoring SORTKEY order causes full table scans. Fix by picking a column used in date or range filters so Redshift prunes blocks.
Using DISTSTYLE EVEN on large fact tables joined to big dimensions forces network shuffles. Set a shared DISTKEY to colocate rows.