Configure Airflow to run ClickHouse SQL, transfer data, and orchestrate pipelines.
Airflow lets you schedule, parameterize, and monitor ClickHouse queries and ETL steps, turning ad-hoc SQL into repeatable data pipelines.
The open-source clickhouse-connect
package ships an ClickHouseHook
and ClickHouseOperator
. Install with pip install apache-airflow-providers-clickhouse
.
In the Airflow UI, add a connection with Conn Id clickhouse_default
, type ClickHouse, host, port, user, password, and database.
Use ClickHouseOperator
to run SQL inline or from a file, or call ClickHouseHook
inside a Python task for dynamic queries.
The DAG below sums daily Orders.total_amount
and inserts the result into a reporting table.
Use Jinja templating with {{ ds }}
or {{ params }}
. The operator parameter parameters
binds values to named placeholders to avoid SQL injection.
Chain an S3ToClickHouseOperator
(or Python task with clickhouse_connect
) after an S3 sensor. Airflow handles dependencies and retries.
Spin up Airflow and ClickHouse via Docker Compose. Export AIRFLOW_CONN_CLICKHOUSE_DEFAULT
for quick CLI tests using airflow tasks test
.
• Set retries
and retry_delay
to handle transient ClickHouse errors.
• Batch large inserts with INSERT ... SELECT
instead of row-by-row.
• Store SQL in repo files for version control.
Any Airflow ≥2.3 works. Ensure the ClickHouse provider version matches your Airflow core version to avoid dependency conflicts.
Yes. Wrap them in a single SQL string separated by semicolons or call several ClickHouseOperators in sequence for clearer logs.
Use ClickHouseHook
. Call .get_conn().query(sql)
or .get_pandas_df(sql)
inside a PythonOperator task.