Use Spark’s JDBC data source to read from and write to SQL Server tables directly from Spark jobs.
Load the Microsoft SQL Server JDBC driver on every Spark executor, then call DataFrameReader.format("jdbc")
with the correct connection options. Spark pushes filters and limits to SQL Server, giving you efficient, parallel reads.
Mandatory keys are url
, dbtable
, user
, and password
. Optional keys like driver
, fetchsize
, and partitionColumn
tune performance.
Set partitionColumn
, lowerBound
, upperBound
, and numPartitions
. Spark generates a predicate per partition, letting SQL Server scan in parallel and saturate network bandwidth.
Yes. Use DataFrameWriter.mode("append")
or "overwrite"
. For bulk speed, enable copybook
bulk copy (bulkCopyBatchSize
, tableLock
), or save to a staging table and run MERGE
.
Read Customers
, join to Orders
, aggregate totals in Spark, then write results to a new table CustomerTotals
. The example below shows every step.
1) Place the JDBC driver on spark.executor.extraClassPath
. 2) Use Kerberos or Azure AD tokens instead of passwords. 3) Cache small lookup tables. 4) Use correct data types—map decimal
to DecimalType
with precision.
Leaving out partition options forces a single-threaded read, causing long runtimes. Define an integer primary-key column and set partition options.
SQL Server requires [schema].[table]
or escaped spaces. Always include schema in dbtable
and wrap in square brackets.
Yes. Spark pushes simple predicates (equality, range) so SQL Server filters rows before transmitting them.
Use the JDBC connection string ;integratedSecurity=true
and distribute mssql-jdbc_auth-*.dll
to every executor.
Bundle the exact Microsoft JDBC driver version that matches your SQL Server with --jars
during spark-submit
.