How Cloudflare Fixed a ClickHouse Query‑Planning Bottleneck
#Infrastructure

How Cloudflare Fixed a ClickHouse Query‑Planning Bottleneck

Serverless Reporter
5 min read

Cloudflare traced a billing‑pipeline slowdown to lock contention in ClickHouse’s query planner, patched the engine to use a shared lock, removed per‑query part copies and streamlined part filtering, cutting query latency by half and influencing future data‑partitioning designs.

Cloudflare’s ClickHouse Query‑Planning Bottleneck

Featured image

Cloudflare’s billing and fraud pipelines rely on ClickHouse to aggregate petabytes of log data. After a migration that introduced per‑tenant partitioning, daily aggregation jobs began to run noticeably slower, even though I/O, memory usage and rows scanned looked normal. The team opened a deep dive into the problem and discovered that the slowdown originated in the query‑planning phase – the step where ClickHouse decides which data parts to read before actual execution.


What Happened?

  • Data growth – By the end of 2024 the Ready‑Analytics table held more than 2 PiB and ingested millions of rows per second.
  • Partition redesign – To support per‑customer retention, Cloudflare added a namespace column to the partition key. The number of parts per table rose sharply.
  • Observed symptom – SELECT queries that previously took a few seconds started taking double that time. Traditional metrics (disk latency, CPU utilisation) stayed within expected ranges.

The engineers profiled the workload with perf and found that 45 % of sampled CPU time was spent inside a single function called filterPartsByPartition. The function was protected by an exclusive mutex (MergeTreeData) that guards the table’s global parts list. With thousands of concurrent queries, each one tried to acquire the same lock, leading to massive contention.


The Fixes Cloudflare Applied

  1. Replace the exclusive lock with a shared lock
    • The original implementation required a writer‑only lock even for read‑only planning work. By switching to a read‑write lock, multiple planners can run in parallel while still preventing concurrent modifications.
  2. Drop the per‑query copy of the parts list
    • Previously ClickHouse duplicated the full list of parts for every query, consuming memory and adding copy overhead. The patch now references the shared list directly, eliminating the copy step.
  3. Improve part‑filtering logic
    • The old filter iterated over the entire parts list regardless of the partition predicate. The new algorithm narrows the scan to the relevant partition range before applying additional filters, reducing the number of comparisons dramatically.

All three changes were contributed back to the ClickHouse upstream project and landed in version 25.11. Cloudflare deployed the patched binaries in March 2026.


Impact on Query Latency

After the rollout, the average SELECT duration fell by roughly 50 %. More importantly, the correlation between query time and the number of parts disappeared – the planner no longer became a bottleneck as the parts count continued to climb.

Metric (pre‑patch) Metric (post‑patch)
Avg SELECT time 2.4 s → 1.2 s
Mutex wait % 55 % → 12 %
Memory per query 12 MiB → 4 MiB

These numbers are taken from Cloudflare’s internal monitoring dashboards, which the team has made publicly visible in their blog post.


Architectural Takeaways

1. Lock granularity matters even in read‑heavy analytics workloads

ClickHouse is built around a MergeTree engine that assumes most operations are append‑only. When a workload introduces frequent reads that need to inspect metadata, a coarse‑grained exclusive lock can become a hidden choke point. Using a read‑write lock or lock‑free data structures for metadata can keep the planner scalable.

2. Avoid per‑query duplication of large metadata structures

Copying the full parts list for every query adds both CPU and memory pressure. Reference‑counted or immutable views of the metadata allow many queries to share the same structure safely.

3. Partition design should consider metadata explosion

Adding a tenant identifier to the partition key increased the number of parts dramatically. While it simplified retention per tenant, it also inflated the size of the parts catalog. A hybrid approach—keeping a coarse global partition and a secondary index for tenant‑level pruning—can limit metadata growth.

4. Monitoring must go beyond high‑level telemetry

Cloudflare’s engineers observed normal I/O and CPU usage, yet the real issue lay in lock contention. Tools that surface wait‑time for mutexes, such as perf lock or ClickHouse’s own system.merges table, are essential for diagnosing similar problems.


What This Means for Other ClickHouse Users

If you are running ClickHouse at scale and have recently introduced new partition keys, consider the following checklist:

  • Profile the planner – Run EXPLAIN PIPELINE on representative queries and look for time spent in filterPartsByPartition or similar functions.
  • Audit lock usage – Verify whether the engine is using exclusive locks for read‑only planning phases.
  • Check parts count – Monitor the growth of the parts catalog (system.parts) and its impact on ZooKeeper or the ClickHouse Keeper service.
  • Upgrade – Versions 25.11 and later include Cloudflare’s patches. Upgrading can bring immediate benefits without custom patches.

Open Questions and Future Work

  • Metadata service scaling – As the parts list continues to expand, the coordination layer (ZooKeeper or ClickHouse Keeper) may become a new bottleneck. Cloudflare’s team is experimenting with a lightweight metadata cache that lives outside the core engine.
  • Dynamic partition pruning – Further work could push pruning logic deeper into the storage layer, allowing the engine to skip loading irrelevant parts entirely.
  • Observability extensions – Adding per‑query lock‑wait histograms to ClickHouse’s system.metrics table would give operators a built‑in view of contention.

Conclusion

Cloudflare’s investigation turned a seemingly opaque billing slowdown into a concrete set of engineering improvements. By exposing lock contention in the query planner, patching the engine to use shared locks, and trimming unnecessary metadata copies, they restored performance and contributed valuable fixes back to the open‑source community. The episode underscores that, at massive scale, coordination overhead can eclipse raw compute resources, and that fine‑grained observability remains a critical tool for cloud architects.


Renato Losio is a principal cloud architect and AWS Data Hero. Follow him on LinkedIn.

Comments

Loading comments...