Skip to content

tune GUC defaults for remote ClickHouse targets#39

Open
iskakaushik wants to merge 2 commits intomainfrom
tune-guc-defaults
Open

tune GUC defaults for remote ClickHouse targets#39
iskakaushik wants to merge 2 commits intomainfrom
tune-guc-defaults

Conversation

@iskakaushik
Copy link
Collaborator

@iskakaushik iskakaushik commented Feb 14, 2026

Summary

  • queue_capacity: 65,536 → 131,072 (~600MB shmem, buffers ~0.5s at 260K events/sec)
  • batch_max: 10,000 → 200,000 (6x drain rate to remote ClickHouse)
  • flush_interval_ms: 1,000 → 200 (faster wakeup for bursty workloads)

Previous defaults were tuned for local ClickHouse (sub-ms INSERT latency). With ClickHouse Cloud (~80ms RTT), the old defaults caused 53% event loss at 37K TPS. The new defaults reduce loss to ~9% for the same workload — still not zero (that needs architectural changes like disk buffering), but a significant improvement with no code changes needed from users.

Benchmark results (32 clients, 30s pgbench, CH Cloud us-west-2):

batch_max Event Loss Drain Rate
10,000 (old default) 53.2% ~20K events/s
100,000 23.6% ~80K events/s
200,000 (new default) 8.7% ~125K events/s
500,000 2.5% ~134K events/s

Users targeting remote ClickHouse with high-throughput workloads should further increase batch_max and queue_capacity based on their load profile.

Test plan

  • Builds clean (mise run build)
  • Regression tests pass
  • Verified with local ClickHouse Docker (no behavior change expected)

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings February 14, 2026 23:15
Previous defaults were tuned for local ClickHouse (sub-millisecond INSERT
latency). With remote targets like ClickHouse Cloud (~80ms RTT), the old
defaults caused 53% event loss under moderate load (37K TPS pgbench).

Changes:
- queue_capacity: 65536 -> 131072 (shmem: ~600MB)
- batch_max: 10000 -> 100000 (amortizes per-INSERT network overhead)
- flush_interval_ms: 1000 -> 200 (faster drain wakeup for bursty workloads)

Benchmarked against ClickHouse Cloud (us-west-2, ~80ms RTT):
  batch_max=10K  -> 53% event loss, 20K events/s drain
  batch_max=100K -> 24% event loss, 80K events/s drain
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR tunes the GUC (Grand Unified Configuration) defaults for pg_stat_ch to optimize performance for remote ClickHouse targets (e.g., ClickHouse Cloud) which have significantly higher network latency (~80ms RTT) compared to local ClickHouse instances (sub-ms latency). The changes increase queue capacity and batch size while reducing flush interval to improve throughput and reduce event loss under high-load scenarios.

Changes:

  • Increased queue_capacity from 65,536 to 131,072 (2x increase, ~600MB shmem)
  • Increased batch_max from 10,000 to 100,000 (10x increase for higher drain rate)
  • Reduced flush_interval_ms from 1,000ms to 200ms (5x faster wake-up for bursty workloads)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant