How to Bulk Load Data in Amazon Redshift

Galaxy Glossary

How do I bulk load data into Amazon Redshift?

COPY ingests external files (CSV, JSON, Parquet, etc.) into Redshift tables in parallel for fast, scalable bulk loading.

Sign up for the latest in SQL knowledge from the Galaxy Team!

Description

What does the COPY command do in Redshift?

COPY ingests data files stored outside the cluster and writes them into a target table using Redshift’s massively parallel architecture. It delivers far higher throughput than INSERT, supports automatic compression, and validates each row while loading.

Which file formats can I load?

COPY handles CSV, JSON, Parquet, Avro, ORC, and fixed-width text. It also accepts GZIP or BZIP2 compression.Pick the FORMAT option that matches your files.

How do I prepare my S3 bucket?

Place source files in an S3 path that the cluster can reach. Attach an IAM role with s3:GetObject permission.If using credentials in the command, keep keys in AWS Secrets Manager.

How to bulk load a CSV file step-by-step?

Step 1 — Create the target table

Define columns in the same order as the file or use the CSV HEADER option to skip column names.

Step 2 — Run COPY

Specify the S3 URI, IAM role, and CSV settings.Use MAXERROR to control tolerance and COMPUPDATE ON to gather stats automatically.

Why choose MANIFEST for multiple files?

When dozens of files exist in varying locations, generate a JSON manifest listing each object and pass MANIFEST to COPY. Redshift loads only the referenced files.

How to load JSON data?

Add the JSON parameter pointing to a JSONPath file or ‘auto’.Redshift flattens the JSON into the table’s columns during load.

Best practices for reliable bulk loads

Stage data in equal-sized files (100 MB–1 GB) to maximize parallelism. Compress files to cut transfer time. Grant least-privilege IAM roles and log loads with STL_LOAD_COMMITS.

.

Why How to Bulk Load Data in Amazon Redshift is important

How to Bulk Load Data in Amazon Redshift Example Usage


-- Load today’s Orders from S3 into Redshift
aws s3 cp orders_2024_05_15.csv s3://galaxy-data/orders/

COPY Orders(id, customer_id, order_date, total_amount)
FROM 's3://galaxy-data/orders/orders_2024_05_15.csv'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
DATEFORMAT 'auto'
TIMEFORMAT 'auto'
COMPUPDATE ON
STATUPDATE ON;

How to Bulk Load Data in Amazon Redshift Syntax


COPY target_table [(column_list)]
FROM 's3://bucket/prefix/'
CREDENTIALS 'aws_access_key_id=<id>;aws_secret_access_key=<key>'
-- or  IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
[REGION 'us-east-2']
FORMAT AS {CSV|JSON 'path'|PARQUET|AVRO|ORC}
[DELIMITER ',' ] [IGNOREHEADER 1]
[MANIFEST]
[COMPUPDATE {ON|OFF}]
[STATUPDATE {ON|OFF}]
[ACCEPTINVCHARS]
[DATEFORMAT 'auto']
[TIMEFORMAT 'auto']
[MAXERROR 0]
[TRUNCATECOLUMNS];

-- Example for the Customers table
COPY Customers(id, name, email, created_at)
FROM 's3://galaxy-data/customers/2024/'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftLoadRole'
FORMAT AS CSV
IGNOREHEADER 1
TIMEFORMAT 'auto'
COMPUPDATE ON;

Common Mistakes

Using access keys in plain text ➜ Hard-codes secrets in scripts. Fix by assigning an IAM role to the Redshift cluster and replacing CREDENTIALS with IAM_ROLE.
Mismatched delimiters or date formats ➜ COPY fails or loads nulls. Fix by explicitly setting DELIMITER, DATEFORMAT, and TIMEFORMAT to match the files.

Frequently Asked Questions (FAQs)

Can I load data directly from a local file?

No. Stage the file in S3 first or use the Redshift Data API to upload small data sets.

How do I monitor a COPY operation?

Query STL_LOAD_COMMITS and STL_LOAD_ERRORS for row counts, error details, and load times. CloudWatch also captures COPY logs.

Is COPY transactional?

Yes. If any file fails, the entire operation rolls back unless you set ACCEPTINVCHARS or MAXERROR to allow errors.

Want to learn about other SQL terms?

dbt tests

dbt tests are declarative assertions in dbt that automatically validate data quality during model builds.

dbt utils

dbt utils is a community-maintained dbt package that adds time-saving utility macros for analytics engineers, such as generating surrogate keys, safely unioning models, and automating date spines.

dbt Tutorial Glossary: Build, Test & Deploy SQL Models

dbt is an open-source framework that lets data teams transform raw warehouse tables into tested, documented, production-ready datasets using version-controlled SQL.

Trusted by top engineers on high-velocity teams

Assort Health

Curri

Rubie