How to Integrate ClickHouse with Apache Spark

How do I integrate ClickHouse with Apache Spark?

Connect Spark to ClickHouse to read and write high-volume data with JDBC or the native connector.

Description

Why integrate ClickHouse with Apache Spark?

ClickHouse delivers sub-second OLAP queries, while Spark excels at distributed transformation. Combining them lets you clean, enrich, and aggregate terabytes in Spark, then store results in ClickHouse for lightning-fast dashboards.

How do I add the ClickHouse connector to Spark?

Include both the ClickHouse JDBC driver and the spark-clickhouse connector JAR when launching Spark.

spark-shell \ --packages ru.yandex.clickhouse:clickhouse-jdbc:0.4.6,\ com.github.housepower:spark-clickhouse-connector_2.12:0.5.0

Gradle / Maven dependency

implementation("ru.yandex.clickhouse:clickhouse-jdbc:0.4.6") implementation("com.github.housepower:spark-clickhouse-connector_2.12:0.5.0")

What is the basic Spark-SQL read syntax?

Use the DataFrame reader with format "clickhouse" (preferred) or "jdbc" and supply connection options.

val customers = spark.read .format("clickhouse") .option("host", "ckh.prod.local") .option("port", "9000") .option("database", "shop") .option("user", "analytics") .option("password", sys.env("CKH_PW")) .option("table", "Customers") .load()

How do I write Spark DataFrames back to ClickHouse?

orders .write .format("clickhouse") .option("host", "ckh.prod.local") .option("database", "shop") .option("table", "Orders") .option("create_table", "auto") .mode("append") .save()

Which options improve performance?

batchsize – raise to 10 000 for bulk inserts
numPartitions – match cluster cores for parallel reads
partitionColumn, lowerBound, upperBound – enable predicate push-down
compress – use lz4 or zstd to cut network I/O

Best practices for ecommerce analytics?

Join Orders, OrderItems, and Products in Spark, calculate daily GMV, then write a summarized daily_sales table to ClickHouse. Keep identical column order and types to avoid cast overhead.

Common mistakes and quick fixes

Missing connector JARs causes ClassNotFoundException; always include both JDBC and Spark connector. Decimal mismatches throw "Illegal column type"—cast Decimal(10,2) to Float64 before writing.

Next steps

Schedule the job in Airflow or use Structured Streaming to push near-real-time updates from Spark to ClickHouse.

Why How to Integrate ClickHouse with Apache Spark is important