How to Back Up a ClickHouse Cluster

Galaxy Glossary

How do I back up a ClickHouse cluster?

Backing up a ClickHouse cluster means creating consistent copies of its data and metadata so you can restore the system after failure, corruption, or accidental deletion.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Description

ClickHouse is famous for its blazing-fast analytics, but speed is useless if you cannot recover from disaster. A robust backup strategy ensures that terabytes—or even petabytes—of production data can be restored with minimal downtime. This article walks through the principles, tools, and real-world workflows for backing up ClickHouse clusters in 2024.

Understanding ClickHouse Storage Architecture

Before you can protect ClickHouse data, you must understand how it is stored:

  • ReplicatedMergeTree engines replicate partitions across shards and replicas using ZooKeeper or ClickHouse Keeper.
  • Parts are immutable data chunks stored as sets of files inside /var/lib/clickhouse/data/<db>/<table>/.
  • Metadata lives in /var/lib/clickhouse/metadata/ and ZooKeeper paths.
  • System tables (e.g., system.macros, system.clusters) hold cluster configuration.

Because parts are immutable, point-in-time consistency is easier than in row-mutable databases, but you still need to capture both data and metadata atomically.

Backup Options at a Glance

1. File-System Snapshots

Using LVM, ZFS, EBS, or Ceph snapshots provides instant, crash-consistent copies. Combine snapshots with object-storage uploads for durability.

2. clickhouse-backup Utility

The de-facto community tool (Altinity/ClickHouse-backup) automates data freezing, compression, and upload to S3, GCS, Azure Blob, or NFS.

3. Built-in SQL Commands

ClickHouse 22.6+ offers native BACKUP and RESTORE statements, especially convenient for Managed ClickHouse services.

4. Managed-Service Snapshots

Cloud providers such as Altinity.Cloud, Aiven, and ClickHouse Cloud schedule automated snapshots and expose one-click restores.

Choosing a Strategy

Your production requirements dictate the mix of techniques:

  • Recovery Time Objective (RTO): How long can analytics be down? Snapshots offer fastest restore; compressed S3 archives take longer.
  • Recovery Point Objective (RPO): How much data can you afford to lose? Continuous BACKUP increments or frequent snapshots shrink the window.
  • Storage Cost: Object storage is cheap but slower; block-level snapshots are fast but pricier.
  • Operational Complexity: Built-in SQL commands reduce moving pieces; community tools give more knobs.

Step-by-Step Guide with clickhouse-backup

1. Install the Tool

sudo wget -O /usr/local/bin/clickhouse-backup \
https://github.com/Altinity/clickhouse-backup/releases/download/v2.5.1/clickhouse-backup-linux-amd64
sudo chmod +x /usr/local/bin/clickhouse-backup

2. Configure

Edit /etc/clickhouse-backup/config.yml:

general:
remote_storage: s3
compression_format: tar
upload_concurrency: 4

s3:
bucket: clickhouse-prod-backups
endpoint: s3.us-east-1.amazonaws.com
access_key_id: <AWS_KEY>
secret_access_key: <AWS_SECRET>
path: /cluster1/{hostname}

3. Freeze Parts

clickhouse-backup freeze --tables "db.*"

The command issues ALTER TABLE … FREEZE to create hard-linked copies under /shadow/.

4. Create and Upload

clickhouse-backup create --backup-name 2024-04-30-full
clickhouse-backup upload 2024-04-30-full

This streams compressed archives to S3 and registers metadata.

5. Automate with Cron or Systemd

0 3 * * * /usr/local/bin/clickhouse-backup create_remote \
--tables "db.*" --full --retention 14d >/var/log/ch-backup.log 2>&1

A daily 03:00 UTC full backup kept for 14 days.

6. Restore Workflow

# Download chosen backup
time clickhouse-backup download 2024-04-30-full
# Restore data and metadata across the cluster
clickhouse-backup restore --rm 2024-04-30-full

The --rm flag drops existing tables first. When using replicated tables, be sure to system sync replicas afterward.

Using Built-in BACKUP and RESTORE Commands

Create a Backup

BACKUP DATABASE analytics, metadata TO S3('s3://clickhouse-prod-backups/2024-04-30')
SETTINGS access_key_id = 'AWS_KEY', secret_access_key='AWS_SECRET';

Incremental Backups

ClickHouse stores backup manifests; unchanged parts are referenced rather than copied, reducing I/O.

Restore

RESTORE ALL FROM S3('s3://clickhouse-prod-backups/2024-04-30')
SETTINGS allow_non_empty_tables = 1;

Best Practices

  • Tag backups with Git SHA or deploy version to align data with application releases.
  • Include ZooKeeper snapshots (snapshot --force <path>) so replicated tables can rebuild peers.
  • Encrypt at rest using server-side or client-side envelope encryption.
  • Test restores monthly; scripted restores are the only proof a backup works.
  • Retain WAL-level or table-level logs for forensic analysis even if they are excluded from restores.
  • Monitor backup jobs via Prometheus metrics exposed by clickhouse-backup server.

Common Misconceptions

"Replication = Backup"

Replication guards against node failure, not operator error. If a DROP TABLE is executed, every replica deletes the data.

"Snapshots are Always Consistent"

File-system snapshots give crash-consistency but may miss un-flushed parts. Use system flush logs or ALTER … FREEZE first.

"I Can Restore Any Node"

To restore a replica in a replicated table, you must remove ZooKeeper paths or use RESTORE REPLICA so identity matches.

Real-World Case Study

FinTech Co ingests 50 B rows per day across 6 shards × 2 replicas. They:

  1. Take hourly incremental backups with BACKUP to S3.
  2. Mirror S3 to Azure via s3-sync for geo-redundancy.
  3. Test a full restore in a staging VPC every Sunday.
  4. Keep 30 days local, 180 days remote, and delete older archives using clickhouse-backup delete remote.

When an engineer accidentally truncated a partition, they restored the affected shard in 7 minutes with zero data loss.

Integrating Backups into CI/CD

Embed ClickHouse restores in integration tests to validate migrations:

docker run --name ch-test -d yandex/clickhouse-server:23.8
clickhouse-backup restore --replication --rm $CI_BACKUP_ID
pytest tests/sql_migrations_test.py

Catch schema drift early and prove your backup can seed ephemeral environments.

Monitoring & Alerting

  • ClickHouseBackupLastSuccessTimestamp > 24 h triggers PagerDuty.
  • Compare system.disks.free_space against backup size to avoid archive failures.
  • Ship /var/log/ch-backup.log to Loki or ELK for auditing.

Conclusion

Backing up a ClickHouse cluster is straightforward when you leverage immutable parts, but the devil is in the details—coordination with ZooKeeper, retention policies, encryption, and continuous testing. Whether you choose file-system snapshots, clickhouse-backup, or native SQL commands, automate everything and rehearse restores regularly.

Why How to Back Up a ClickHouse Cluster is important

ClickHouse often serves as the single source of truth for petabyte-scale analytics. A failed node, bad deployment, or accidental DROP can wipe out critical insight and revenue. Solid backup strategy safeguards data integrity, minimizes downtime, meets compliance standards, and instills confidence in engineering teams that fast analytics won’t come at the cost of resiliency.

How to Back Up a ClickHouse Cluster Example Usage


BACKUP DATABASE analytics TO S3('s3://ch-backups/backup-2024-04-30');

Common Mistakes

Frequently Asked Questions (FAQs)

How long does a typical ClickHouse backup take?

On SSD-backed nodes, a 1 TB dataset compresses and uploads to S3 in roughly 15–25 minutes using four upload threads. Incremental backups after the first full run are dramatically faster because unchanged parts are skipped.

Can I query data while a backup is running?

Yes. ClickHouse parts are immutable, so read and write workloads continue unhindered. The ALTER ... FREEZE step merely creates hard links, introducing negligible I/O overhead.

Do I need to stop ZooKeeper during backup?

No. Both clickhouse-backup and native BACKUP capture ZooKeeper metadata automatically or via an additional snapshot, allowing the service to stay online.

What is the best retention policy?

A common pattern is: keep 7 daily, 4 weekly, 6 monthly, and 12 yearly backups. Adjust based on compliance, cost, and data-change rate.

Want to learn about other SQL terms?