How to Integrate Oracle with Apache Spark

How do I integrate Oracle with Apache Spark?

Use Spark’s JDBC data source to read from and write to Oracle, allowing large-scale analytics on Oracle data with Spark’s distributed engine.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Why integrate Oracle with Apache Spark?

Running analytics directly on an Oracle OLTP instance stresses the database. Pulling data into Spark off-loads heavy aggregation, joins, and ML workloads while Oracle remains the system of record.

Which drivers and libraries are required?

Add the Oracle JDBC driver (ojdbc8.jar or ojdbc11.jar) to Spark’s classpath. With Spark 3+, specify --jars or spark.jars in spark-submit or SparkSession builder.

How to build the JDBC connection string?

Format: jdbc:oracle:thin:@//HOST:PORT/SERVICE. Example: jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB.Include user, password, fetchsize, and partitioning options as DataFrame reader/write options.

How to load Oracle tables into Spark DataFrames?

Use spark.read.format("jdbc") with url, dbtable, user, and password. Optionally add partitionColumn, lowerBound, upperBound, and numPartitions for parallel reads.

How to write Spark results back to Oracle?

Call DataFrame.write.format("jdbc").mode("append") and supply the same connection options plus the target Oracle table name.

How to read multiple ecommerce tables efficiently?

Leverage predicate push-down by using the sql option instead of dbtable.Example: SELECT * FROM Orders WHERE order_date > SYSDATE-30.

What performance tuning knobs matter?

Adjust fetchsize (default 10); 1000–5000 works well. Use partitioning on numeric or date columns with even distribution.Increase numPartitions to match Oracle CPU cores.

Common mistakes and fixes

Wrong driver version: Using ojdbc6 with JDK 11 causes class errors—use ojdbc8 or ojdbc11.
Single-partition reads: Omitting partitionColumn loads data through one connection—set partitioning options for parallelism.

Best practices for production?

Cache small dimension tables in memory, schedule incremental loads with watermarks, and monitor Oracle session counts. Always close Spark sessions to release JDBC connections.

FAQ

Does Spark push filters to Oracle?

Yes.Spark translates filter predicates into the generated Oracle query, reducing data transfer.

Can I use Kerberos or SSL?

Yes. Append oracle.net.wallet_location or Kerberos parameters in the JDBC URL and include the wallet or krb5.conf on all Spark nodes.

How do I avoid Oracle table locks?

Use read-only accounts and set the session to READ ONLY when possible. Writing with mode("append") issues INSERTs, avoiding full table locks.

Why How to Integrate Oracle with Apache Spark is important

How to Integrate Oracle with Apache Spark Example Usage


# PySpark example – load recent Orders and Products, join, and write totals
df_orders = (spark.read.format("jdbc")
  .option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB")
  .option("dbtable", "Orders")
  .option("user", "analytics")
  .option("password", "******")
  .option("fetchsize", "2000")
  .option("partitionColumn", "id")
  .option("lowerBound", "1")
  .option("upperBound", "500000")
  .option("numPartitions", "10")
  .load()
  .filter("order_date >= current_date() - interval 30 days"))

df_items = (spark.read.format("jdbc")
  .option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB")
  .option("dbtable", "OrderItems")
  .option("user", "analytics")
  .option("password", "******")
  .load())

df_products = (spark.read.format("jdbc")
  .option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB")
  .option("dbtable", "Products")
  .option("user", "analytics")
  .option("password", "******")
  .load())

df_total = (df_orders.join(df_items, df_orders.id == df_items.order_id)
    .join(df_products, df_items.product_id == df_products.id)
    .groupBy("customer_id")
    .sum("total_amount"))

df_total.write.format("jdbc").option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB") \
    .option("dbtable", "CUSTOMER_30DAY_TOTALS") \
    .option("user", "analytics") \
    .option("password", "******") \
    .mode("overwrite") \
    .save()

How to Integrate Oracle with Apache Spark Syntax


// PySpark syntax for Oracle JDBC integration
spark.read \
  .format("jdbc") \
  .option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB") \
  .option("dbtable", "Customers") \
  .option("user", "analytics") \
  .option("password", "******") \
  .option("fetchsize", "2000") \
  .option("partitionColumn", "id") \
  .option("lowerBound", "1") \
  .option("upperBound", "100000") \
  .option("numPartitions", "8") \
  .load()

// Writing aggregated sales back to Oracle
sales_df.write \
  .format("jdbc") \
  .option("url", "jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB") \
  .option("dbtable", "DAILY_SALES_SUMMARY") \
  .option("user", "analytics") \
  .option("password", "******") \
  .mode("append") \
  .save()

Common Mistakes

Using the default fetchsize of 10 rows, causing thousands of round trips. Fix: set .option("fetchsize", "2000") or higher to reduce latency.
Choosing a skewed partition column like customer_id, leading to hotspot queries. Fix: pick numeric surrogate keys or date ranges with uniform distribution.