Troubleshooting “disk full” errors in ClickHouse

How do I troubleshoot “disk full” errors in ClickHouse?

A systematic approach to diagnosing and resolving ClickHouse disk-full errors by identifying space consumers, freeing space safely, and preventing recurrence.

Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)

Oops! Something went wrong while submitting the form.

Description

Example H2

Example H3

Overview

Few incidents bring a ClickHouse cluster to its knees faster than a "disk full" error. Because ClickHouse is a column-oriented database that writes immutable parts to disk and relies on background merges, the server must have free space for write-ahead logs, merges, and new parts. When the underlying storage device is 100 % utilized, INSERTs begin to fail, merges stop, and read queries may degrade as housekeeping threads continually retry.

Why disk-full conditions happen in ClickHouse

Rapid data growth — unexpected ingestion spikes or poor partitioning can overwhelm planned capacity.
Long-running merges & mutations — to merge two parts ClickHouse creates a third, temporarily doubling required space.
Large DELETE or UPDATE mutations — until mutations finish, old parts stay on disk.
Misconfigured TTLs — data expected to age out is retained indefinitely.
Log files & system cache leakage — system.trace_log, part log, and other system tables grow unbounded.
Non-ClickHouse files — backups, core dumps, and application artifacts live on the same mount.

Symptoms you will see

INSERT errors: DB::Exception: Not enough disk space
Background merges halted: Cannot reserve messages in system.text_log
df -h shows 100 % usage on the data path
Query latency spikes as merges queue up

A step-by-step troubleshooting workflow

The quickest way to get your cluster healthy is to follow a structured checklist.

1. Identify every storage path ClickHouse uses

SELECT name, path, free_space, total_space, formatReadableSize(free_space) AS free, formatReadableSize(total_space) AS total FROM system.disks;

Confirm the specific mount (often /var/lib/clickhouse/) that is full. If multiple disks are configured, ensure the correct one is targeted for cleanup.

2. Quantify which tables and parts consume space

SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size, count() AS active_parts FROM system.parts WHERE active GROUP BY database, table ORDER BY sum(bytes_on_disk) DESC LIMIT 20;

The output quickly shows the worst offenders. For replicated tables, run the query on each replica; disk usage can differ because of merge lag or mutations.

3. Check for unfinished merges, mutations and TTL moves

-- Pending mutations SELECT database, table, mutation_id, command, parts_to_do, is_done FROM system.mutations WHERE is_done = 0; -- Merge queue backlog SELECT * FROM system.merge_queue ORDER BY create_time DESC LIMIT 20;

If hundreds of parts are waiting to merge, your disk may be double-booked. Consider SET execute_merges_on_single_replica = 1 temporarily or throttle ingestion while merges catch up (requires allow_execute_merges_on_single_replica in newer versions).

4. Purge safely using built-in mechanisms

Optimize at partition level to complete merges quickly and drop intermediate parts:
OPTIMIZE TABLE db.tbl PARTITION '2024-05-23' FINAL;
Drop partitions you no longer need:
ALTER TABLE db.tbl DROP PARTITION '2023-12'; (Instant metadata removal; actual parts are deleted asynchronously.)
FREEZE before DROP if you need a backup (ALTER TABLE FREEZE PARTITION).
SYSTEM FLUSH LOGS then truncate verbose system logs if they keep months of history.

5. Emergency: manual deletion of obsolete mutations

As a last resort—only if ClickHouse cannot start—you can delete outdated directories under store/ that match mutation IDs already marked done. Perform backups first and follow official docs; accidental removal corrupts the replica.

6. Restart merges

SYSTEM START MERGES;

Watch system.merges and system.events to ensure throughput recovers.

Long-term prevention and monitoring

Set aggressive data TTLs

ALTER TABLE logs MODIFY TTL timestamp + INTERVAL 90 DAY DELETE;

Always combine DELETE TTL with MOVE TO VOLUME 'slow' or object storage tiers to free SSD space earlier.

Reserve space for merges

storage_configuration.disks.reserve_space (ClickHouse >=22.8) keeps N GB free for emergencies.

Separate WAL & data

Put tmp/ and metadata/ on faster local disks, while historical data goes to cheaper drives.

Alerting queries

SELECT now() AS ts, name, free_space, total_space, round(free_space / total_space, 3) AS pct_free FROM system.disks WHERE free_space / total_space < 0.1;

Schedule in your observability stack (Grafana, Prometheus, or run interactively via Galaxy).

Galaxy & troubleshooting workflow

Because disk-full triage often happens under time pressure, a fast SQL editor accelerates root-cause analysis. With Galaxy you can:

Run the diagnostic queries above against production replicas without SSHing into the host.
Use the AI Copilot to explain unusual results, e.g., "Why does system.mutations show 10 k pending mutations for events?"
Save proven cleanup scripts in a Collection labelled “Incident Run-books” so on-call engineers reuse endorsed SQL instead of pasting commands into Slack.
Apply least-privilege access controls so only SREs can execute DROP or TRUNCATE commands.

Best Practices checklist

Provision at least 30 – 50 % headroom beyond daily peak ingest so merges never starve.
Use TTL DELETE and TTL MOVE on every high-volume table.
Keep WAL (tmp/) on a disk with separate capacity planning.
Monitor MergeTreeDataDiskSpaceReserved metric; alert below 10 % free.
Prefer OPTIMIZE ... FINAL to force large merges during maintenance windows.

Common misconceptions

Engineers sometimes rely on practices that work for row-store databases but backfire in ClickHouse:

“I can delete files directly from the filesystem.”
Without updating system metadata, ClickHouse expects the parts to be present and will refuse to start or replicate. Always use ALTER TABLE ... DROP.
“Adding another replica will give me more space.”
Replicas store identical copies, so total cluster usage doubles. You need sharding or tiered storage, not replication.
“Truncating a table frees disk instantly.”
TRUNCATE is metadata-only; background cleanup threads delete parts asynchronously. Disk may remain full for minutes or hours.

Conclusion

Disk-full incidents in ClickHouse are disruptive but avoidable. With a rigorous checklist, proactive TTL policies, and real-time monitoring—augmented by the speed and collaboration features of Galaxy—you can resolve emergencies quickly and keep clusters healthy.

Why Troubleshooting “disk full” errors in ClickHouse is important

A disk-full condition halts INSERTs, blocks merges, and can cascade into service outages. Understanding the ClickHouse storage engine, its merge behavior, and safe cleanup procedures is essential for on-call engineers tasked with maintaining uptime in analytics pipelines that ingest terabytes daily.

Troubleshooting “disk full” errors in ClickHouse Example Usage


SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active GROUP BY database, table ORDER BY size DESC LIMIT 10;

Troubleshooting “disk full” errors in ClickHouse Syntax

Common Mistakes

Manually deleting part directories from the filesystem. This corrupts metadata and prevents replicas from syncing. Always use ALTER statements like DROP PARTITION so ClickHouse cleans up safely.
Assuming TRUNCATE or DROP immediately frees up space. These commands are asynchronous; until background threads delete files, the disk remains full. Monitor system.parts to verify actual bytes_on_disk decrease.
Relying solely on replicas for capacity planning. Replication adds redundancy, not capacity. Without sharding or tiered storage the disk-full condition propagates to every replica.

Frequently Asked Questions (FAQs)

How can I quickly find which ClickHouse tables are consuming the most disk?

Run the query shown in the Example Query section against system.parts. Ordering by sum(bytes_on_disk) surfaces the largest tables instantly.

Is it safe to remove the `/var/lib/clickhouse/tmp/` directory during an outage?

Only if ClickHouse is stopped and you are certain no merges are in progress. The safer path is to start ClickHouse with --mark_cache_size=0 and let it clean temp files itself.

How can Galaxy help troubleshoot disk full errors in ClickHouse?

Galaxy’s desktop SQL editor lets you execute the diagnostic queries in this article, view results side-by-side, and store them in a shared Collection. The AI Copilot can suggest optimizations and summarize findings for incident reports, accelerating mean time to resolution (MTTR).

Will adding more replicas solve disk space issues?

No. Replication duplicates data for redundancy. To gain space you need sharding, tiered storage (e.g., S3 volumes), or data retention policies.