A systematic approach to diagnosing and resolving ClickHouse disk-full errors by identifying space consumers, freeing space safely, and preventing recurrence.
Few incidents bring a ClickHouse cluster to its knees faster than a "disk full" error. Because ClickHouse is a column-oriented database that writes immutable parts to disk and relies on background merges, the server must have free space for write-ahead logs, merges, and new parts. When the underlying storage device is 100 % utilized, INSERTs begin to fail, merges stop, and read queries may degrade as housekeeping threads continually retry.
DB::Exception: Not enough disk space
Cannot reserve
messages in system.text_log
df -h
shows 100 % usage on the data pathThe quickest way to get your cluster healthy is to follow a structured checklist.
SELECT name, path, free_space, total_space,
formatReadableSize(free_space) AS free,
formatReadableSize(total_space) AS total
FROM system.disks;
Confirm the specific mount (often /var/lib/clickhouse/
) that is full. If multiple disks are configured, ensure the correct one is targeted for cleanup.
SELECT database,
table,
formatReadableSize(sum(bytes_on_disk)) AS size,
count() AS active_parts
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;
The output quickly shows the worst offenders. For replicated tables, run the query on each replica; disk usage can differ because of merge lag or mutations.
-- Pending mutations
SELECT database, table, mutation_id, command, parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0;
-- Merge queue backlog
SELECT *
FROM system.merge_queue
ORDER BY create_time DESC
LIMIT 20;
If hundreds of parts are waiting to merge, your disk may be double-booked. Consider SET execute_merges_on_single_replica = 1
temporarily or throttle ingestion while merges catch up (requires allow_execute_merges_on_single_replica
in newer versions).
OPTIMIZE TABLE db.tbl PARTITION '2024-05-23' FINAL;
ALTER TABLE db.tbl DROP PARTITION '2023-12';
(Instant metadata removal; actual parts are deleted asynchronously.)ALTER TABLE FREEZE PARTITION
).As a last resort—only if ClickHouse cannot start—you can delete outdated directories under store/
that match mutation IDs already marked done. Perform backups first and follow official docs; accidental removal corrupts the replica.
SYSTEM START MERGES;
Watch system.merges
and system.events
to ensure throughput recovers.
ALTER TABLE logs
MODIFY TTL timestamp + INTERVAL 90 DAY DELETE;
Always combine DELETE TTL with MOVE TO VOLUME 'slow'
or object storage tiers to free SSD space earlier.
storage_configuration.disks.reserve_space
(ClickHouse >=22.8) keeps N GB free for emergencies.Put tmp/
and metadata/
on faster local disks, while historical data goes to cheaper drives.
SELECT
now() AS ts,
name,
free_space,
total_space,
round(free_space / total_space, 3) AS pct_free
FROM system.disks
WHERE free_space / total_space < 0.1;
Schedule in your observability stack (Grafana, Prometheus, or run interactively via Galaxy).
Because disk-full triage often happens under time pressure, a fast SQL editor accelerates root-cause analysis. With Galaxy you can:
system.mutations
show 10 k pending mutations for events
?"TTL DELETE
and TTL MOVE
on every high-volume table.tmp/
) on a disk with separate capacity planning.MergeTreeDataDiskSpaceReserved
metric; alert below 10 % free.OPTIMIZE ... FINAL
to force large merges during maintenance windows.Engineers sometimes rely on practices that work for row-store databases but backfire in ClickHouse:
ALTER TABLE ... DROP
.TRUNCATE
is metadata-only; background cleanup threads delete parts asynchronously. Disk may remain full for minutes or hours.Disk-full incidents in ClickHouse are disruptive but avoidable. With a rigorous checklist, proactive TTL policies, and real-time monitoring—augmented by the speed and collaboration features of Galaxy—you can resolve emergencies quickly and keep clusters healthy.
A disk-full condition halts INSERTs, blocks merges, and can cascade into service outages. Understanding the ClickHouse storage engine, its merge behavior, and safe cleanup procedures is essential for on-call engineers tasked with maintaining uptime in analytics pipelines that ingest terabytes daily.
Run the query shown in the Example Query section against system.parts
. Ordering by sum(bytes_on_disk)
surfaces the largest tables instantly.
/var/lib/clickhouse/tmp/
directory during an outage?Only if ClickHouse is stopped and you are certain no merges are in progress. The safer path is to start ClickHouse with --mark_cache_size=0
and let it clean temp files itself.
Galaxy’s desktop SQL editor lets you execute the diagnostic queries in this article, view results side-by-side, and store them in a shared Collection. The AI Copilot can suggest optimizations and summarize findings for incident reports, accelerating mean time to resolution (MTTR).
No. Replication duplicates data for redundancy. To gain space you need sharding, tiered storage (e.g., S3 volumes), or data retention policies.