Use Spark’s JDBC data source to read from and write to Oracle, allowing large-scale analytics on Oracle data with Spark’s distributed engine.
Running analytics directly on an Oracle OLTP instance stresses the database. Pulling data into Spark off-loads heavy aggregation, joins, and ML workloads while Oracle remains the system of record.
Add the Oracle JDBC driver (ojdbc8.jar or ojdbc11.jar) to Spark’s classpath. With Spark 3+, specify --jars or spark.jars in spark-submit or SparkSession builder.
Format: jdbc:oracle:thin:@//HOST:PORT/SERVICE. Example: jdbc:oracle:thin:@//db.acme.com:1521/ORCLCDB.Include user, password, fetchsize, and partitioning options as DataFrame reader/write options.
Use spark.read.format("jdbc") with url, dbtable, user, and password. Optionally add partitionColumn, lowerBound, upperBound, and numPartitions for parallel reads.
Call DataFrame.write.format("jdbc").mode("append") and supply the same connection options plus the target Oracle table name.
Leverage predicate push-down by using the sql option instead of dbtable.Example: SELECT * FROM Orders WHERE order_date > SYSDATE-30.
Adjust fetchsize (default 10); 1000–5000 works well. Use partitioning on numeric or date columns with even distribution.Increase numPartitions to match Oracle CPU cores.
Wrong driver version: Using ojdbc6 with JDK 11 causes class errors—use ojdbc8 or ojdbc11.
Single-partition reads: Omitting partitionColumn loads data through one connection—set partitioning options for parallelism.
Cache small dimension tables in memory, schedule incremental loads with watermarks, and monitor Oracle session counts. Always close Spark sessions to release JDBC connections.
Yes.Spark translates filter predicates into the generated Oracle query, reducing data transfer.
Yes. Append oracle.net.wallet_location or Kerberos parameters in the JDBC URL and include the wallet or krb5.conf on all Spark nodes.
Use read-only accounts and set the session to READ ONLY when possible. Writing with mode("append") issues INSERTs, avoiding full table locks.
.
Yes. Use SCAN listeners in the JDBC URL for automatic load balancing across RAC nodes.
Not natively. Create or alter Oracle tables beforehand, or use .option("createTableColumnTypes", ...) when saving.
Password, Kerberos, and Oracle Wallet SSL are all compatible with Spark’s JDBC connector.