Load or write Amazon Redshift tables from/to Apache Spark DataFrames through the spark-redshift connector and Redshift JDBC driver.
Combine Redshift’s managed columnar warehouse with Spark’s in-memory processing. Perform heavy transformations or machine learning in Spark, then persist cleansed, aggregated results back to Redshift for BI users.
Add the Redshift JDBC (redshift-jdbc42) and the spark-redshift
or redshift
DataSource JARs to your Spark cluster. On EMR or Glue, specify them in the bootstrap or job parameters.
Configure the DataFrame reader with format("io.github.spark_redshift")
(or "redshift"
on AWS Data Wrangler), supply the JDBC URL, IAM role ARN, target dbtable
, and an S3 tempdir
.
Call df.write
with identical options plus mode("append"|"overwrite")
. The connector stages Parquet/CSV files in S3 and issues a Redshift COPY
.
The role must allow redshift:CopyFromS3
, redshift:UnloadToS3
, and S3 read/write on the specified bucket path.
Stage data in Parquet, set diststyle key
/ sortkey
on large tables before loading, and size the Redshift cluster or RA3 concurrency scaling slot to match data volume.
Yes. When you chain .filter()
before .load()
, the connector embeds your predicates in the generated SELECT, reducing data scanned in Redshift.
Read Orders
and Products
, join, calculate margin, and write results to daily_order_margin
in Redshift for downstream dashboards.
Yes. Glue 4.x ships with the Redshift connector. Provide identical options in the Glue job script.
Serverless endpoints work the same; change the JDBC host to the serverless endpoint and keep IAM permissions identical.
Add .option("query", "SELECT * FROM Orders WHERE order_date >= current_date - 7")
instead of dbtable
to pull a filtered result set.