Use Spark’s JDBC data source to read from and write to MySQL tables at scale.
Pulling data from MySQL into Spark lets you run distributed analytics, join with other data sources, and write the results back to MySQL for downstream apps, dashboards, or micro-services.
Use the jdbc:mysql://
URL, a user with SELECT/INSERT privileges, and specify driver com.mysql.cj.jdbc.Driver
. Always add rewriteBatchedStatements=true
to speed up writes.
spark.read.format("jdbc")
.option("url", "jdbc:mysql://db.prod:3306/shop")
.option("dbtable", "Customers")
.option("user", "analytics")
.option("password", env["MYSQL_PWD"])
.load()
Load each table as a DataFrame, register temp views, then run Spark SQL: SELECT o.id, SUM(oi.quantity) qty FROM Orders o JOIN OrderItems oi ON o.id = oi.order_id GROUP BY o.id
.
Call df.write
with mode("append")
or mode("overwrite")
, set batchsize
(500–5 000), and enable isolationLevel="NONE"
when eventual consistency is acceptable.
Filter on created_at > '{{last_run}}'
or use the partitionColumn
, lowerBound
, and upperBound
options to parallel-scan numeric primary keys.
predicatePushdown=true
.Without adding mysql-connector-j.jar
to Spark’s classpath, the job fails. Ship it with --jars
or place it in $SPARK_HOME/jars
.
Leaving batchsize
at 1 causes thousands of round-trips. Set .option("batchsize", 1000)
to cut runtime dramatically.
Yes—simple =, <, >
filters on non-computed columns are translated into the JDBC query, reducing data transfer.
Append ?useSSL=true&requireSSL=true
to the JDBC URL and provide the keystore/truststore paths with spark.driver.extraJavaOptions
.
Yes, simple WHERE filters are translated to MySQL, minimizing transferred rows.
Append ?useSSL=true&requireSSL=true
to the JDBC URL and pass keystore/truststore locations via Spark JVM options.