How to Integrate SQL Server with Apache Spark

Galaxy Glossary

How do you integrate SQL Server with Apache Spark using JDBC?

Use Spark’s JDBC data source to read from and write to SQL Server tables directly from Spark jobs.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.

Description

How do I connect SQL Server to Apache Spark?

Load the Microsoft SQL Server JDBC driver on every Spark executor, then call DataFrameReader.format("jdbc") with the correct connection options. Spark pushes filters and limits to SQL Server, giving you efficient, parallel reads.

Which Spark options are required?

Mandatory keys are url, dbtable, user, and password. Optional keys like driver, fetchsize, and partitionColumn tune performance.

How can I partition reads for speed?

Set partitionColumn, lowerBound, upperBound, and numPartitions. Spark generates a predicate per partition, letting SQL Server scan in parallel and saturate network bandwidth.

Can I write DataFrames back to SQL Server?

Yes. Use DataFrameWriter.mode("append") or "overwrite". For bulk speed, enable copybook bulk copy (bulkCopyBatchSize, tableLock), or save to a staging table and run MERGE.

What does a full read → transform → write flow look like?

Read Customers, join to Orders, aggregate totals in Spark, then write results to a new table CustomerTotals. The example below shows every step.

Best practices for production?

1) Place the JDBC driver on spark.executor.extraClassPath. 2) Use Kerberos or Azure AD tokens instead of passwords. 3) Cache small lookup tables. 4) Use correct data types—map decimal to DecimalType with precision.

Common mistakes

Using a single partition

Leaving out partition options forces a single-threaded read, causing long runtimes. Define an integer primary-key column and set partition options.

Forgetting to quote schema names

SQL Server requires [schema].[table] or escaped spaces. Always include schema in dbtable and wrap in square brackets.

Why How to Integrate SQL Server with Apache Spark is important

How to Integrate SQL Server with Apache Spark Example Usage


# 1. Read Customers and Orders
df_customers = spark.read.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[Customers]",
  user="spark_app", password="pwd",
  partitionColumn="id", lowerBound="1", upperBound="100000", numPartitions="8"
).load()

df_orders = spark.read.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[Orders]",
  user="spark_app", password="pwd"
).load()

# 2. Transform: total spend per customer
df_total = df_orders.groupBy("customer_id").sum("total_amount") \
    .withColumnRenamed("sum(total_amount)", "lifetime_value")

# 3. Write totals back to SQL Server
df_total.write.format("jdbc").options(
  url="jdbc:sqlserver://db:1433;databaseName=ecommerce",
  dbtable="[dbo].[CustomerTotals]",
  user="spark_app", password="pwd"
).mode("overwrite").save()

How to Integrate SQL Server with Apache Spark Syntax


spark.read.format("jdbc") \
  .option("url", "jdbc:sqlserver://HOST:1433;databaseName=ecommerce") \
  .option("dbtable", "[dbo].[Customers]") \
  .option("user", "spark_app") \
  .option("password", "****") \
  -- optional performance tuning
  .option("partitionColumn", "id") \
  .option("lowerBound", "1") \
  .option("upperBound", "100000") \
  .option("numPartitions", "8") \
  .load()

# Write back
aggregated_df.write.format("jdbc") \
  .option("url", "jdbc:sqlserver://HOST:1433;databaseName=ecommerce") \
  .option("dbtable", "[dbo].[CustomerTotals]") \
  .option("user", "spark_app") \
  .option("password", "****") \
  .mode("overwrite") \
  .save()

Common Mistakes

Frequently Asked Questions (FAQs)

Can Spark push filters to SQL Server?

Yes. Spark pushes simple predicates (equality, range) so SQL Server filters rows before transmitting them.

Is Windows authentication supported?

Use the JDBC connection string ;integratedSecurity=true and distribute mssql-jdbc_auth-*.dll to every executor.

How do I avoid driver version mismatches?

Bundle the exact Microsoft JDBC driver version that matches your SQL Server with --jars during spark-submit.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie
BauHealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
Oops! Something went wrong while submitting the form.