Troubleshooting “disk full” errors in ClickHouse

Galaxy Glossary

How do I troubleshoot “disk full” errors in ClickHouse?

A systematic approach to diagnosing and resolving ClickHouse disk-full errors by identifying space consumers, freeing space safely, and preventing recurrence.

Sign up for the latest in SQL knowledge from the Galaxy Team!
Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)
Oops! Something went wrong while submitting the form.

Description

Table of Contents

Overview

Few incidents bring a ClickHouse cluster to its knees faster than a "disk full" error. Because ClickHouse is a column-oriented database that writes immutable parts to disk and relies on background merges, the server must have free space for write-ahead logs, merges, and new parts. When the underlying storage device is 100 % utilized, INSERTs begin to fail, merges stop, and read queries may degrade as housekeeping threads continually retry.

Why disk-full conditions happen in ClickHouse

  • Rapid data growth — unexpected ingestion spikes or poor partitioning can overwhelm planned capacity.
  • Long-running merges & mutations — to merge two parts ClickHouse creates a third, temporarily doubling required space.
  • Large DELETE or UPDATE mutations — until mutations finish, old parts stay on disk.
  • Misconfigured TTLs — data expected to age out is retained indefinitely.
  • Log files & system cache leakage — system.trace_log, part log, and other system tables grow unbounded.
  • Non-ClickHouse files — backups, core dumps, and application artifacts live on the same mount.

Symptoms you will see

  • INSERT errors: DB::Exception: Not enough disk space
  • Background merges halted: Cannot reserve messages in system.text_log
  • df -h shows 100 % usage on the data path
  • Query latency spikes as merges queue up

A step-by-step troubleshooting workflow

The quickest way to get your cluster healthy is to follow a structured checklist.

1. Identify every storage path ClickHouse uses

SELECT name, path, free_space, total_space,
formatReadableSize(free_space) AS free,
formatReadableSize(total_space) AS total
FROM system.disks;

Confirm the specific mount (often /var/lib/clickhouse/) that is full. If multiple disks are configured, ensure the correct one is targeted for cleanup.

2. Quantify which tables and parts consume space

SELECT database,
table,
formatReadableSize(sum(bytes_on_disk)) AS size,
count() AS active_parts
FROM system.parts
WHERE active
GROUP BY database, table
ORDER BY sum(bytes_on_disk) DESC
LIMIT 20;

The output quickly shows the worst offenders. For replicated tables, run the query on each replica; disk usage can differ because of merge lag or mutations.

3. Check for unfinished merges, mutations and TTL moves

-- Pending mutations
SELECT database, table, mutation_id, command, parts_to_do, is_done
FROM system.mutations
WHERE is_done = 0;

-- Merge queue backlog
SELECT *
FROM system.merge_queue
ORDER BY create_time DESC
LIMIT 20;

If hundreds of parts are waiting to merge, your disk may be double-booked. Consider SET execute_merges_on_single_replica = 1 temporarily or throttle ingestion while merges catch up (requires allow_execute_merges_on_single_replica in newer versions).

4. Purge safely using built-in mechanisms

  • Optimize at partition level to complete merges quickly and drop intermediate parts:
    OPTIMIZE TABLE db.tbl PARTITION '2024-05-23' FINAL;
  • Drop partitions you no longer need:
    ALTER TABLE db.tbl DROP PARTITION '2023-12'; (Instant metadata removal; actual parts are deleted asynchronously.)
  • FREEZE before DROP if you need a backup (ALTER TABLE FREEZE PARTITION).
  • SYSTEM FLUSH LOGS then truncate verbose system logs if they keep months of history.

5. Emergency: manual deletion of obsolete mutations

As a last resort—only if ClickHouse cannot start—you can delete outdated directories under store/ that match mutation IDs already marked done. Perform backups first and follow official docs; accidental removal corrupts the replica.

6. Restart merges

SYSTEM START MERGES;

Watch system.merges and system.events to ensure throughput recovers.

Long-term prevention and monitoring

Set aggressive data TTLs

ALTER TABLE logs
MODIFY TTL timestamp + INTERVAL 90 DAY DELETE;

Always combine DELETE TTL with MOVE TO VOLUME 'slow' or object storage tiers to free SSD space earlier.

Reserve space for merges

  • storage_configuration.disks.reserve_space (ClickHouse >=22.8) keeps N GB free for emergencies.

Separate WAL & data

Put tmp/ and metadata/ on faster local disks, while historical data goes to cheaper drives.

Alerting queries

SELECT
now() AS ts,
name,
free_space,
total_space,
round(free_space / total_space, 3) AS pct_free
FROM system.disks
WHERE free_space / total_space < 0.1;

Schedule in your observability stack (Grafana, Prometheus, or run interactively via Galaxy).

Galaxy & troubleshooting workflow

Because disk-full triage often happens under time pressure, a fast SQL editor accelerates root-cause analysis. With Galaxy you can:

  • Run the diagnostic queries above against production replicas without SSHing into the host.
  • Use the AI Copilot to explain unusual results, e.g., "Why does system.mutations show 10 k pending mutations for events?"
  • Save proven cleanup scripts in a Collection labelled “Incident Run-books” so on-call engineers reuse endorsed SQL instead of pasting commands into Slack.
  • Apply least-privilege access controls so only SREs can execute DROP or TRUNCATE commands.

Best Practices checklist

  • Provision at least 30 – 50 % headroom beyond daily peak ingest so merges never starve.
  • Use TTL DELETE and TTL MOVE on every high-volume table.
  • Keep WAL (tmp/) on a disk with separate capacity planning.
  • Monitor MergeTreeDataDiskSpaceReserved metric; alert below 10 % free.
  • Prefer OPTIMIZE ... FINAL to force large merges during maintenance windows.

Common misconceptions

Engineers sometimes rely on practices that work for row-store databases but backfire in ClickHouse:

  1. “I can delete files directly from the filesystem.”
    Without updating system metadata, ClickHouse expects the parts to be present and will refuse to start or replicate. Always use ALTER TABLE ... DROP.
  2. “Adding another replica will give me more space.”
    Replicas store identical copies, so total cluster usage doubles. You need sharding or tiered storage, not replication.
  3. “Truncating a table frees disk instantly.”
    TRUNCATE is metadata-only; background cleanup threads delete parts asynchronously. Disk may remain full for minutes or hours.

Conclusion

Disk-full incidents in ClickHouse are disruptive but avoidable. With a rigorous checklist, proactive TTL policies, and real-time monitoring—augmented by the speed and collaboration features of Galaxy—you can resolve emergencies quickly and keep clusters healthy.

Why Troubleshooting “disk full” errors in ClickHouse is important

A disk-full condition halts INSERTs, blocks merges, and can cascade into service outages. Understanding the ClickHouse storage engine, its merge behavior, and safe cleanup procedures is essential for on-call engineers tasked with maintaining uptime in analytics pipelines that ingest terabytes daily.

Troubleshooting “disk full” errors in ClickHouse Example Usage


SELECT database, table, formatReadableSize(sum(bytes_on_disk)) AS size FROM system.parts WHERE active GROUP BY database, table ORDER BY size DESC LIMIT 10;

Troubleshooting “disk full” errors in ClickHouse Syntax



Common Mistakes

Frequently Asked Questions (FAQs)

How can I quickly find which ClickHouse tables are consuming the most disk?

Run the query shown in the Example Query section against system.parts. Ordering by sum(bytes_on_disk) surfaces the largest tables instantly.

Is it safe to remove the /var/lib/clickhouse/tmp/ directory during an outage?

Only if ClickHouse is stopped and you are certain no merges are in progress. The safer path is to start ClickHouse with --mark_cache_size=0 and let it clean temp files itself.

How can Galaxy help troubleshoot disk full errors in ClickHouse?

Galaxy’s desktop SQL editor lets you execute the diagnostic queries in this article, view results side-by-side, and store them in a shared Collection. The AI Copilot can suggest optimizations and summarize findings for incident reports, accelerating mean time to resolution (MTTR).

Will adding more replicas solve disk space issues?

No. Replication duplicates data for redundancy. To gain space you need sharding, tiered storage (e.g., S3 volumes), or data retention policies.

Want to learn about other SQL terms?

Trusted by top engineers on high-velocity teams
Aryeo Logo
Assort Health
Curri
Rubie Logo
Bauhealth Logo
Truvideo Logo
Welcome to the Galaxy, Guardian!
You'll be receiving a confirmation email

Follow us on twitter :)
Oops! Something went wrong while submitting the form.