How to Integrate ClickHouse with Apache Spark

Galaxy Glossary

How do I integrate ClickHouse with Apache Spark?

Connect Spark to ClickHouse to read and write high-volume data with JDBC or the native connector.

Sign up for the latest in SQL knowledge from the Galaxy Team!

Description

Why integrate ClickHouse with Apache Spark?

ClickHouse delivers sub-second OLAP queries, while Spark excels at distributed transformation. Combining them lets you clean, enrich, and aggregate terabytes in Spark, then store results in ClickHouse for lightning-fast dashboards.

How do I add the ClickHouse connector to Spark?

Include both the ClickHouse JDBC driver and the spark-clickhouse connector JAR when launching Spark.

spark-shell \
--packages ru.yandex.clickhouse:clickhouse-jdbc:0.4.6,\
com.github.housepower:spark-clickhouse-connector_2.12:0.5.0

Gradle / Maven dependency

implementation("ru.yandex.clickhouse:clickhouse-jdbc:0.4.6")
implementation("com.github.housepower:spark-clickhouse-connector_2.12:0.5.0")

What is the basic Spark-SQL read syntax?

Use the DataFrame reader with format "clickhouse" (preferred) or "jdbc" and supply connection options.

val customers = spark.read
.format("clickhouse")
.option("host", "ckh.prod.local")
.option("port", "9000")
.option("database", "shop")
.option("user", "analytics")
.option("password", sys.env("CKH_PW"))
.option("table", "Customers")
.load()

How do I write Spark DataFrames back to ClickHouse?

orders
.write
.format("clickhouse")
.option("host", "ckh.prod.local")
.option("database", "shop")
.option("table", "Orders")
.option("create_table", "auto")
.mode("append")
.save()

Which options improve performance?

  • batchsize – raise to 10 000 for bulk inserts
  • numPartitions – match cluster cores for parallel reads
  • partitionColumn, lowerBound, upperBound – enable predicate push-down
  • compress – use lz4 or zstd to cut network I/O

Best practices for ecommerce analytics?

Join Orders, OrderItems, and Products in Spark, calculate daily GMV, then write a summarized daily_sales table to ClickHouse. Keep identical column order and types to avoid cast overhead.

Common mistakes and quick fixes

Missing connector JARs causes ClassNotFoundException; always include both JDBC and Spark connector. Decimal mismatches throw "Illegal column type"—cast Decimal(10,2) to Float64 before writing.

Next steps

Schedule the job in Airflow or use Structured Streaming to push near-real-time updates from Spark to ClickHouse.

Why How to Integrate ClickHouse with Apache Spark is important

How to Integrate ClickHouse with Apache Spark Example Usage


// Aggregate ecommerce revenue and store in ClickHouse
val revenue = spark.table("Orders")
  .join(spark.table("OrderItems"), "order_id")
  .groupBy(col("order_date"))
  .agg(round(sum(col("quantity") * col("price")), 2).alias("total_revenue"))

revenue.write
  .format("clickhouse")
  .option("host", "ckh.prod.local")
  .option("database", "shop")
  .option("table", "daily_sales")
  .option("create_table", "auto")
  .mode("overwrite")
  .save()

How to Integrate ClickHouse with Apache Spark Syntax


// READ from ClickHouse
auto df = spark.read
             .format("clickhouse")
             .option("host", "<host>")
             .option("port", "9000")
             .option("database", "shop")
             .option("user", "analytics")
             .option("password", "<pwd>")
             .option("table", "Customers")
             .option("numPartitions", "8")
             .load();

// WRITE to ClickHouse
df.write
  .format("clickhouse")
  .option("host", "<host>")
  .option("database", "shop")
  .option("table", "Orders")
  .option("batchsize", "10000")
  .option("compress", "lz4")
  .mode("append")
  .save();

Common Mistakes

Frequently Asked Questions (FAQs)

Can I use Spark Structured Streaming with ClickHouse?

Yes. Configure the write stream with format "clickhouse" and a checkpoint location. The connector will perform micro-batch inserts.

Does predicate push-down work?

Only when using the native connector. Specify partitionColumn, lowerBound, and upperBound to enable data pruning.

How do I handle schema evolution?

Set create_table to auto for first write, then alter ClickHouse tables manually or through versioned DDL scripts to keep schema changes explicit.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie
BauHealth Logo
Truvideo Logo