Add jitter to fetch poll loop to prevent producer stampeding#1150
Add jitter to fetch poll loop to prevent producer stampeding#1150brandur merged 3 commits intoriverqueue:masterfrom
Conversation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Not sure: should this be configurable? |
|
Full background disclosure (copying this from one of the the other PRs that was merged): I was quite lazy overall in these PRs (mostly just vibe:d during one evening with some questions from me to opus and tests added), but I hope they can be of some use. Background: I had seen some odd and rather large discrete performance degradations over time at work (where we use river). I wanted, initially just for fun, to let opus 4.6 have a look at the river source code, to see if it could find any obvious places for improvements (and potential culprits for our performance degradations). 3 of these 4 PRs are mostly just small things, but the jitter one could potentially be what we need to solve our problems at work. It's a bit difficult to reproduce, as we have a couple of pods running and starting at different times - and only during DB connection resets/downtime would the stampeding herd kick in, I think (once per 1-2 weeks by just looking at our cpu use history so far, but then it stayed in that high cpu use state and only a redeploy of our pods would solve it). Perhaps it would be possible to write some tests that demonstrate the issue in a good way |
| d := p.jitteredFetchPollInterval() | ||
| require.GreaterOrEqual(t, d, p.config.FetchPollInterval) | ||
| require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown) | ||
| } |
There was a problem hiding this comment.
This is a bit silly opus generated code, but, I'll let you decide if you want to keep it or do something else with it. I usually tend to just accept Opus' choices in most cases, as long as it's isolated and doesn't risk any issues elsewhere
There was a problem hiding this comment.
Gotcha. Yeah, a little contrived but it's probably okay.
brandur
left a comment
There was a problem hiding this comment.
Nice! Damn, this must be one of the oldest TODOs in the code base, so nice to see it resolved. Left one comment here, but I think it's largely good to go.
producer_test.go
Outdated
| // Run enough iterations to catch any out-of-bounds values without being | ||
| // flaky. The jitter range is [FetchPollInterval, FetchPollInterval + | ||
| // FetchCooldown), so [1s, 1.1s). | ||
| for range 1_000 { |
There was a problem hiding this comment.
Maybe bring this down to 100 or something. Both numbers are a little arbitrary, but feels like we don't need to crunch away 1k times.
| d := p.jitteredFetchPollInterval() | ||
| require.GreaterOrEqual(t, d, p.config.FetchPollInterval) | ||
| require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown) | ||
| } |
There was a problem hiding this comment.
Gotcha. Yeah, a little contrived but it's probably okay.
producer.go
Outdated
| // synchronizing their fetches after a transient event (e.g. GC pause, network | ||
| // blip), which would cause periodic DB load spikes. | ||
| func (p *producer) jitteredFetchPollInterval() time.Duration { | ||
| return randutil.DurationBetween(p.config.FetchPollInterval, p.config.FetchPollInterval+p.config.FetchCooldown) |
There was a problem hiding this comment.
Is there any particular reason to use p.config.FetchCooldown in particular as the added number to the poll interval here?
I'm just wondering if we should make it some other number. Like for example, max(p.config.FetchPollInterval * 0.1, 100 * time.Millisecond) or something of that nature so that it's more related to FetchPollInterval.
There was a problem hiding this comment.
Honestly I was pushing back towards opus idea on that, I should have pushed more :).
Your suggestion here is better I would say, though, I would probably cap its lower end at 10 ms instead of 100ms
My other idea was to make it a separate config parameter entirely
There was a problem hiding this comment.
@brandur I pushed a new commit which I think adjusts it more towards your idea
Use 10% of FetchPollInterval (min 10ms) as the jitter range rather than coupling it to the unrelated FetchCooldown config value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
[0, FetchCooldown)to eachfetchPollLooptimer reset to prevent multiple producers from phase-locking their poll timersproducer.goThe problem
When multiple River producers are connected to the same Postgres, their
fetchPollLooptimers can phase-lock after a transient event, causing a permanent increase in DB load:FetchPollInterval(default 1s). DB load is low and even.fetchPollLooptimers fire at the same instant andReset(FetchPollInterval)from that same moment. They are now synchronized.Before the sync event: N producers spread across 1s → low, even load.
After: N producers hitting the DB at the same millisecond every 1s → periodic spikes that raise average CPU baseline. This manifests as a sudden, permanent step-up in CPU usage that never recovers.
The fix
Add
[0, FetchCooldown)of random jitter to each poll timer reset (via the existingrandutil.DurationBetweenutility). With default config values, the poll interval becomes[1s, 1.1s)instead of a fixed1s.This means even if timers accidentally synchronize, they naturally drift apart within a few cycles. The jitter is bounded by
FetchCooldown(default 100ms), which is small relative toFetchPollInterval(default 1s), so it doesn't meaningfully affect job pickup latency.This follows the same pattern already used by the leadership elector (
elector.go:240) and maintenance services.Test plan
TestProducer_jitteredFetchPollIntervalruns 1,000 iterations verifying all values stay within[FetchPollInterval, FetchPollInterval+FetchCooldown)TestProducertests pass🤖 Generated with Claude Code