How to Integrate SQL Server with Apache Spark

How do you integrate SQL Server with Apache Spark using JDBC?

Use Spark’s JDBC data source to read from and write to SQL Server tables directly from Spark jobs.

Welcome to the Galaxy, Guardian!

Oops! Something went wrong while submitting the form.

Description

How do I connect SQL Server to Apache Spark?

Load the Microsoft SQL Server JDBC driver on every Spark executor, then call DataFrameReader.format("jdbc") with the correct connection options. Spark pushes filters and limits to SQL Server, giving you efficient, parallel reads.

Which Spark options are required?

Mandatory keys are url, dbtable, user, and password. Optional keys like driver, fetchsize, and partitionColumn tune performance.

How can I partition reads for speed?

Set partitionColumn, lowerBound, upperBound, and numPartitions. Spark generates a predicate per partition, letting SQL Server scan in parallel and saturate network bandwidth.

Can I write DataFrames back to SQL Server?

Yes. Use DataFrameWriter.mode("append") or "overwrite". For bulk speed, enable copybook bulk copy (bulkCopyBatchSize, tableLock), or save to a staging table and run MERGE.

What does a full read → transform → write flow look like?

Read Customers, join to Orders, aggregate totals in Spark, then write results to a new table CustomerTotals. The example below shows every step.

Best practices for production?

1) Place the JDBC driver on spark.executor.extraClassPath. 2) Use Kerberos or Azure AD tokens instead of passwords. 3) Cache small lookup tables. 4) Use correct data types—map decimal to DecimalType with precision.

Common mistakes

Using a single partition

Leaving out partition options forces a single-threaded read, causing long runtimes. Define an integer primary-key column and set partition options.

Forgetting to quote schema names

SQL Server requires [schema].[table] or escaped spaces. Always include schema in dbtable and wrap in square brackets.

Why How to Integrate SQL Server with Apache Spark is important

How to Integrate SQL Server with Apache Spark Example Usage


# 1. Read Customers and Orders
df_customers = spark.read.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[Customers]",
  user="spark_app", password="pwd",
  partitionColumn="id", lowerBound="1", upperBound="100000", numPartitions="8"
).load()

df_orders = spark.read.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[Orders]",
  user="spark_app", password="pwd"
).load()

# 2. Transform: total spend per customer
df_total = df_orders.groupBy("customer_id").sum("total_amount") \
    .withColumnRenamed("sum(total_amount)", "lifetime_value")

# 3. Write totals back to SQL Server
df_total.write.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[CustomerTotals]",
  user="spark_app", password="pwd"
).mode("overwrite").save()

How to Integrate SQL Server with Apache Spark Syntax


spark.read.format("jdbc") \
  .option("url", "jdbc:sqlserver://HOST:1433;databaseName=ecommerce") \
  .option("dbtable", "[dbo].[Customers]") \
  .option("user", "spark_app") \
  .option("password", "****") \
  -- optional performance tuning
  .option("partitionColumn", "id") \
  .option("lowerBound", "1") \
  .option("upperBound", "100000") \
  .option("numPartitions", "8") \
  .load()

# Write back
aggregated_df.write.format("jdbc") \
  .option("url", "jdbc:sqlserver://HOST:1433;databaseName=ecommerce") \
  .option("dbtable", "[dbo].[CustomerTotals]") \
  .option("user", "spark_app") \
  .option("password", "****") \
  .mode("overwrite") \
  .save()

Common Mistakes

Omitting partition options causes single-threaded reads. Fix by setting partitionColumn, lowerBound, upperBound, numPartitions.
Excluding schema in dbtable leads to "Invalid object name" errors. Fix by wrapping schema and table in square brackets, e.g., [dbo].[Orders].