MariaDB–Spark integration lets Spark read from and write to MariaDB through the JDBC driver for large-scale analytics.
Spark adds distributed processing to MariaDB’s transactional data. You can run heavy joins, aggregates, and ML pipelines without exporting CSVs or overloading the OLTP server.
Add the MariaDB JDBC JAR (e.g., org.mariadb.jdbc:mariadb-java-client:3.2.0
). Spark shell example: spark-shell --packages org.mariadb.jdbc:mariadb-java-client:3.2.0
.
Format: jdbc:mariadb://HOST:PORT/DATABASE?user=USER&password=PW&ssl=true
. Optional params: serverTimezone
, socketTimeout
, permitMysqlScheme
.
Call spark.read.format("jdbc")
with url
, dbtable
, user
, and password
. Add partitionColumn
, lowerBound
, upperBound
, and numPartitions
for parallel reads.
val customers = spark.read.format("jdbc")
.option("url", "jdbc:mariadb://db:3306/shop")
.option("dbtable", "Customers")
.option("user", "analytics")
.option("password", sys.env("SHOP_PW"))
.option("partitionColumn", "id")
.option("lowerBound", "1")
.option("upperBound", "500000")
.option("numPartitions", "8")
.load()
Create DataFrames for Orders
, OrderItems
, and Products
, register temp views, then use Spark SQL.
spark.sql("""
SELECT p.name, DATE(o.order_date) AS order_day,
SUM(oi.quantity * p.price) AS revenue
FROM orders o
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON p.id = oi.product_id
GROUP BY p.name, order_day
ORDER BY revenue DESC
""").show()
Use df.write.mode("append").format("jdbc").options(jdbcOpts).save()
. Provide createTableColumnTypes
if Spark must create the table.
The MariaDB user needs SELECT
on source tables and INSERT/UPDATE
on targets. Restrict to analytics hosts for security.
Enable SSL, cap fetch size (fetchsize=1000
), and isolate analytics traffic on replicas. Monitor Spark executors to avoid long-running locks.
Single-partition scans slow jobs. Always set partitioning options.
Missing driver JAR causes ClassNotFoundException
. Confirm the JAR is on every Spark node.
Check Apache Spark JDBC docs and MariaDB Connector/J guide for advanced parameters like allowPublicKeyRetrieval
and failover hosts.
Using a read-only replica prevents heavy analytic queries from blocking OLTP traffic. Point the JDBC URL to the replica host when possible.
Yes. Spark’s JDBC connector pushes simple WHERE clauses and column projections, reducing data transfer. Complex expressions run in Spark.
Add ?ssl=true&trustServerCertificate=true
to the JDBC URL or provide a truststore via -Djavax.net.ssl.trustStore
.