How to Integrate Amazon Redshift with Apache Spark

Galaxy Glossary

How do I integrate Amazon Redshift with Apache Spark?

Load or write Amazon Redshift tables from/to Apache Spark DataFrames through the spark-redshift connector and Redshift JDBC driver.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)
Oops! Something went wrong while submitting the form.

Description

Table of Contents

Why integrate Apache Spark with Amazon Redshift?

Combine Redshift’s managed columnar warehouse with Spark’s in-memory processing. Perform heavy transformations or machine learning in Spark, then persist cleansed, aggregated results back to Redshift for BI users.

What libraries are required?

Add the Redshift JDBC (redshift-jdbc42) and the spark-redshift or redshift DataSource JARs to your Spark cluster. On EMR or Glue, specify them in the bootstrap or job parameters.

How do I read a Redshift table into a DataFrame?

Configure the DataFrame reader with format("io.github.spark_redshift") (or "redshift" on AWS Data Wrangler), supply the JDBC URL, IAM role ARN, target dbtable, and an S3 tempdir.

How do I write a DataFrame back to Redshift?

Call df.write with identical options plus mode("append"|"overwrite"). The connector stages Parquet/CSV files in S3 and issues a Redshift COPY.

Which IAM permissions are needed?

The role must allow redshift:CopyFromS3, redshift:UnloadToS3, and S3 read/write on the specified bucket path.

How can I improve performance?

Stage data in Parquet, set diststyle key / sortkey on large tables before loading, and size the Redshift cluster or RA3 concurrency scaling slot to match data volume.

Can filters be pushed down?

Yes. When you chain .filter() before .load(), the connector embeds your predicates in the generated SELECT, reducing data scanned in Redshift.

Example: enrich daily order margins

Read Orders and Products, join, calculate margin, and write results to daily_order_margin in Redshift for downstream dashboards.

Why How to Integrate Amazon Redshift with Apache Spark is important

How to Integrate Amazon Redshift with Apache Spark Example Usage


# Enrich Orders with product margins
orders_df = (spark.read.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "Orders")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .load())

products_df = (spark.read.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "Products")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .load())

margin_df = (orders_df.join(products_df, orders_df.product_id == products_df.id)
  .withColumn("margin", (products_df.price - products_df.cost) * orders_df.quantity))

(margin_df.write.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "daily_order_margin")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .mode("append")
  .save())

How to Integrate Amazon Redshift with Apache Spark Syntax


spark.read \
  .format("io.github.spark_redshift") \
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev?user=awsuser&password=***") \
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload") \
  .option("dbtable", "Orders") \
  .option("tempdir", "s3://my-bucket/redshift_temp/") \
  .load()

# Write back
(df_result
  .write
  .format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "daily_order_margin")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .mode("overwrite")
  .save())

Common Mistakes

Frequently Asked Questions (FAQs)

Can I use AWS Glue instead of vanilla Spark?

Yes. Glue 4.x ships with the Redshift connector. Provide identical options in the Glue job script.

Does the connector support Redshift Serverless?

Serverless endpoints work the same; change the JDBC host to the serverless endpoint and keep IAM permissions identical.

How do I load only recent data?

Add .option("query", "SELECT * FROM Orders WHERE order_date >= current_date - 7") instead of dbtable to pull a filtered result set.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie Logo
Bauhealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)
Oops! Something went wrong while submitting the form.