Connect Spark to ClickHouse to read and write high-volume data with JDBC or the native connector.
ClickHouse delivers sub-second OLAP queries, while Spark excels at distributed transformation. Combining them lets you clean, enrich, and aggregate terabytes in Spark, then store results in ClickHouse for lightning-fast dashboards.
Include both the ClickHouse JDBC driver and the spark-clickhouse connector JAR when launching Spark.
spark-shell \
--packages ru.yandex.clickhouse:clickhouse-jdbc:0.4.6,\
com.github.housepower:spark-clickhouse-connector_2.12:0.5.0
implementation("ru.yandex.clickhouse:clickhouse-jdbc:0.4.6")
implementation("com.github.housepower:spark-clickhouse-connector_2.12:0.5.0")
Use the DataFrame reader with format "clickhouse" (preferred) or "jdbc" and supply connection options.
val customers = spark.read
.format("clickhouse")
.option("host", "ckh.prod.local")
.option("port", "9000")
.option("database", "shop")
.option("user", "analytics")
.option("password", sys.env("CKH_PW"))
.option("table", "Customers")
.load()
orders
.write
.format("clickhouse")
.option("host", "ckh.prod.local")
.option("database", "shop")
.option("table", "Orders")
.option("create_table", "auto")
.mode("append")
.save()
batchsize
– raise to 10 000 for bulk insertsnumPartitions
– match cluster cores for parallel readspartitionColumn
, lowerBound
, upperBound
– enable predicate push-downcompress
– use lz4 or zstd to cut network I/OJoin Orders
, OrderItems
, and Products
in Spark, calculate daily GMV, then write a summarized daily_sales
table to ClickHouse. Keep identical column order and types to avoid cast overhead.
Missing connector JARs causes ClassNotFoundException; always include both JDBC and Spark connector. Decimal mismatches throw "Illegal column type"—cast Decimal(10,2)
to Float64
before writing.
Schedule the job in Airflow or use Structured Streaming to push near-real-time updates from Spark to ClickHouse.
Yes. Configure the write stream with format "clickhouse" and a checkpoint location. The connector will perform micro-batch inserts.
Only when using the native connector. Specify partitionColumn
, lowerBound
, and upperBound
to enable data pruning.
Set create_table
to auto for first write, then alter ClickHouse tables manually or through versioned DDL scripts to keep schema changes explicit.