How to Integrate Amazon Redshift with Apache Spark

How do I integrate Amazon Redshift with Apache Spark?

Load or write Amazon Redshift tables from/to Apache Spark DataFrames through the spark-redshift connector and Redshift JDBC driver.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Why integrate Apache Spark with Amazon Redshift?

Combine Redshift’s managed columnar warehouse with Spark’s in-memory processing. Perform heavy transformations or machine learning in Spark, then persist cleansed, aggregated results back to Redshift for BI users.

What libraries are required?

Add the Redshift JDBC (redshift-jdbc42) and the spark-redshift or redshift DataSource JARs to your Spark cluster. On EMR or Glue, specify them in the bootstrap or job parameters.

How do I read a Redshift table into a DataFrame?

Configure the DataFrame reader with format("io.github.spark_redshift") (or "redshift" on AWS Data Wrangler), supply the JDBC URL, IAM role ARN, target dbtable, and an S3 tempdir.

How do I write a DataFrame back to Redshift?

Call df.write with identical options plus mode("append"|"overwrite"). The connector stages Parquet/CSV files in S3 and issues a Redshift COPY.

Which IAM permissions are needed?

The role must allow redshift:CopyFromS3, redshift:UnloadToS3, and S3 read/write on the specified bucket path.

How can I improve performance?

Stage data in Parquet, set diststyle key / sortkey on large tables before loading, and size the Redshift cluster or RA3 concurrency scaling slot to match data volume.

Can filters be pushed down?

Yes. When you chain .filter() before .load(), the connector embeds your predicates in the generated SELECT, reducing data scanned in Redshift.

Example: enrich daily order margins

Read Orders and Products, join, calculate margin, and write results to daily_order_margin in Redshift for downstream dashboards.

Why How to Integrate Amazon Redshift with Apache Spark is important

How to Integrate Amazon Redshift with Apache Spark Example Usage


# Enrich Orders with product margins
orders_df = (spark.read.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "Orders")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .load())

products_df = (spark.read.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "Products")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .load())

margin_df = (orders_df.join(products_df, orders_df.product_id == products_df.id)
  .withColumn("margin", (products_df.price - products_df.cost) * orders_df.quantity))

(margin_df.write.format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "daily_order_margin")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .mode("append")
  .save())

How to Integrate Amazon Redshift with Apache Spark Syntax


spark.read \
  .format("io.github.spark_redshift") \
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev?user=awsuser&password=***") \
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload") \
  .option("dbtable", "Orders") \
  .option("tempdir", "s3://my-bucket/redshift_temp/") \
  .load()

# Write back
(df_result
  .write
  .format("io.github.spark_redshift")
  .option("url", "jdbc:redshift://redshift-cluster:5439/dev")
  .option("aws_iam_role", "arn:aws:iam::123456789012:role/RedshiftCopyUnload")
  .option("dbtable", "daily_order_margin")
  .option("tempdir", "s3://my-bucket/redshift_temp/")
  .mode("overwrite")
  .save())

Common Mistakes

Omitting the <code>tempdir</code> option. Without an S3 staging path the connector cannot COPY or UNLOAD, so the job fails. Always point it to an S3 bucket your role can access.
Using user/password auth when an IAM role is available. JDBC user creds expose secrets and block COPY UNLOAD acceleration. Prefer the <code>aws_iam_role</code> parameter for secure, faster loads.