tune GUC defaults for remote ClickHouse targets#39
Open
iskakaushik wants to merge 2 commits intomainfrom
Open
Conversation
Previous defaults were tuned for local ClickHouse (sub-millisecond INSERT latency). With remote targets like ClickHouse Cloud (~80ms RTT), the old defaults caused 53% event loss under moderate load (37K TPS pgbench). Changes: - queue_capacity: 65536 -> 131072 (shmem: ~600MB) - batch_max: 10000 -> 100000 (amortizes per-INSERT network overhead) - flush_interval_ms: 1000 -> 200 (faster drain wakeup for bursty workloads) Benchmarked against ClickHouse Cloud (us-west-2, ~80ms RTT): batch_max=10K -> 53% event loss, 20K events/s drain batch_max=100K -> 24% event loss, 80K events/s drain
6b7bab5 to
5b7ea92
Compare
There was a problem hiding this comment.
Pull request overview
This PR tunes the GUC (Grand Unified Configuration) defaults for pg_stat_ch to optimize performance for remote ClickHouse targets (e.g., ClickHouse Cloud) which have significantly higher network latency (~80ms RTT) compared to local ClickHouse instances (sub-ms latency). The changes increase queue capacity and batch size while reducing flush interval to improve throughput and reduce event loss under high-load scenarios.
Changes:
- Increased
queue_capacityfrom 65,536 to 131,072 (2x increase, ~600MB shmem) - Increased
batch_maxfrom 10,000 to 100,000 (10x increase for higher drain rate) - Reduced
flush_interval_msfrom 1,000ms to 200ms (5x faster wake-up for bursty workloads)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
queue_capacity: 65,536 → 131,072 (~600MB shmem, buffers ~0.5s at 260K events/sec)batch_max: 10,000 → 200,000 (6x drain rate to remote ClickHouse)flush_interval_ms: 1,000 → 200 (faster wakeup for bursty workloads)Previous defaults were tuned for local ClickHouse (sub-ms INSERT latency). With ClickHouse Cloud (~80ms RTT), the old defaults caused 53% event loss at 37K TPS. The new defaults reduce loss to ~9% for the same workload — still not zero (that needs architectural changes like disk buffering), but a significant improvement with no code changes needed from users.
Benchmark results (32 clients, 30s pgbench, CH Cloud us-west-2):
Users targeting remote ClickHouse with high-throughput workloads should further increase
batch_maxandqueue_capacitybased on their load profile.Test plan
mise run build)🤖 Generated with Claude Code