Why Too Many Tiny Parts Undermine ClickHouse Performance
#Infrastructure

Why Too Many Tiny Parts Undermine ClickHouse Performance

Backend Reporter
5 min read

ClickHouse stores data as immutable parts; when inserts generate parts faster than background merges can compact them, fragmentation explodes. This hidden buildup inflates metadata, merge load, query latency, and recovery time. Controlling batch size, partition granularity, and using async inserts are practical ways to keep part counts manageable, while the FINAL clause only masks the symptom.

Why Too Many Tiny Parts Undermine ClickHouse Performance

Featured image

ClickHouse’s MergeTree family stores every INSERT as an immutable part on disk. A single INSERT INTO events VALUES (…) creates a new part, and that is the expected, healthy behavior of the engine. The problem appears not when parts exist, but when they accumulate faster than the background merge process can compact them.


The hidden problem: part explosion

Symptom Root cause
Insert latency climbs Merge threads lag behind part creation
Queries scan more files More parts = more file descriptors and metadata reads
Memory usage spikes Merge scheduler holds many pending merges
Restart / replication slows Server must load metadata for thousands of parts

At first the cluster looks fine. Tiny micro‑batches or row‑by‑row inserts generate a steady stream of parts, but the merge subsystem keeps up. Over minutes or hours the parts‑per‑second rate overtakes the merges‑per‑second rate. Fragmentation grows silently, and every new query has to open and coordinate a larger set of files. The result is a classic case of operational latency that is not directly tied to data volume.


Why merges matter so much

Merges are the engine’s garbage collector:

  • Combine small parts into larger ones, reducing the number of files.
  • Re‑compress data, improving columnar compression ratios.
  • Consolidate primary‑key indexes, making range scans cheaper.

When merges run behind, each new part adds to the metadata burden (part headers, granule indexes, checksums). The scheduler must track more pending jobs, and the I/O subsystem sees a higher churn of open/close operations. The cost is linear in the number of parts, not in the amount of actual data.


Practical mitigation strategies

1. Batch inserts appropriately

Large batches produce fewer parts. A rule of thumb is to aim for 10 MiB–100 MiB per part (the exact sweet spot depends on your compression codec and query patterns). This reduces merge pressure and improves compression.

2. Avoid over‑partitioning

Partitioning by toYYYYMMDDhh(timestamp) creates a new partition every hour. If your workload inserts a few rows per hour, each insert spawns a distinct part and a distinct partition, multiplying the merge workload. A coarser granularity such as toYYYYMM(timestamp) or even daily partitions is usually sufficient for time‑series data.

3. Enable async inserts

Modern ClickHouse versions support async inserts (SET async_insert = 1). The server buffers incoming rows in memory and writes them out as larger parts, dramatically cutting the number of tiny parts generated by bursty streams. See the official docs for tuning parameters like async_insert_max_data_size.

4. Tune merge settings

  • max_bytes_to_merge_at_max_space_in_pool – limits how much data a single merge can touch, preventing a single huge merge from starving others.
  • number_of_free_entries_in_pool_to_lower_max_bytes_to_merge – reduces the merge size when the pool is near capacity, keeping merges responsive.
  • merge_max_sleep_ms – adds a small sleep between merge steps to avoid CPU contention with query workers.

Adjust these values only after measuring the current merge backlog with system.merges and system.parts tables.

5. Periodic manual merges (use with caution)

Running OPTIMIZE TABLE events FINAL forces a merge of all parts in a table. This can be useful after a known ingestion burst, but it does not solve the underlying pattern. Repeated manual optimizations can waste CPU and I/O, and they still leave the merge scheduler overwhelmed if inserts continue at the same rate.


Trade‑offs and pitfalls

Approach Benefit Cost / Risk
Larger batches Fewer parts, better compression, lower merge load Higher latency for individual events; requires buffering in the client or an intermediate queue
Coarser partitions Reduces part count per partition, easier merges May increase scan range for time‑based queries if partitions become too large
Async inserts Transparent buffering, minimal client changes Memory pressure on the ClickHouse server; mis‑configured limits can cause OOM
Aggressive merge tuning Keeps merge queue short, improves query latency May starve CPU from query execution; can increase I/O churn
Manual OPTIMIZE FINAL Immediate reduction of part count Expensive operation; does not prevent future fragmentation

The right mix depends on workload characteristics: bursty event streams benefit most from async inserts and batch aggregation, while steady high‑throughput pipelines may need a combination of larger batch sizes and tuned merge settings.


The FINAL clause is not a cure

SELECT * FROM events FINAL forces ClickHouse to merge parts on‑the‑fly during query execution. It does not delete the underlying tiny parts; it merely hides the fragmentation for that query. In a system already saturated with parts, the FINAL clause can make the query more expensive, because the engine must read and combine many fragments just to produce the result set.


Operational impact beyond queries

  • Startup time – On node restart, ClickHouse scans every part header to rebuild its in‑memory catalog. Tens of thousands of parts can add minutes to boot time.
  • Replication lag – Each part must be sent to replicas. A high part count inflates network traffic and replication queue size.
  • Backup/restore – Snapshot tools copy whole part files; many small files increase filesystem overhead and storage costs.

Bottom line

Too many tiny parts are a merge‑management problem, not a raw‑data‑size problem. By shaping ingestion to produce fewer, larger parts, keeping partition granularity sensible, and leveraging async inserts, you keep the merge subsystem in the driver’s seat. The result is lower latency, stable memory usage, and a cluster that scales predictably.

For further reading, see:


The insights above come from years of troubleshooting fragmented ClickHouse clusters in production. The patterns repeat across any large‑scale analytical store: when the storage engine spends more time managing its own metadata than serving queries, performance inevitably degrades.

Comments

Loading comments...