Skip to content

Add jitter to fetch poll loop to prevent producer stampeding#1150

Merged
brandur merged 3 commits intoriverqueue:masterfrom
GiGurra:fix/fetch-poll-jitter
Feb 15, 2026
Merged

Add jitter to fetch poll loop to prevent producer stampeding#1150
brandur merged 3 commits intoriverqueue:masterfrom
GiGurra:fix/fetch-poll-jitter

Conversation

@GiGurra
Copy link
Contributor

@GiGurra GiGurra commented Feb 13, 2026

Summary

  • Add random jitter of [0, FetchCooldown) to each fetchPollLoop timer reset to prevent multiple producers from phase-locking their poll timers
  • Remove the resolved TODO comment from producer.go
  • Add test verifying jitter stays within expected bounds

The problem

When multiple River producers are connected to the same Postgres, their fetchPollLoop timers can phase-lock after a transient event, causing a permanent increase in DB load:

  1. Producers start at different times, so their poll timers are naturally spread out across the FetchPollInterval (default 1s). DB load is low and even.
  2. A transient event occurs — a GC pause, a network blip, Postgres connection pool saturation, or a burst of LISTEN/NOTIFY — causing all producer goroutines to stall briefly.
  3. When they resume, all accumulated fetchPollLoop timers fire at the same instant and Reset(FetchPollInterval) from that same moment. They are now synchronized.
  4. Since the interval is fixed with no jitter, they stay synchronized permanently — all N producers hit the database at the same millisecond, every second, forever.

Before the sync event: N producers spread across 1s → low, even load.
After: N producers hitting the DB at the same millisecond every 1s → periodic spikes that raise average CPU baseline. This manifests as a sudden, permanent step-up in CPU usage that never recovers.

The fix

Add [0, FetchCooldown) of random jitter to each poll timer reset (via the existing randutil.DurationBetween utility). With default config values, the poll interval becomes [1s, 1.1s) instead of a fixed 1s.

This means even if timers accidentally synchronize, they naturally drift apart within a few cycles. The jitter is bounded by FetchCooldown (default 100ms), which is small relative to FetchPollInterval (default 1s), so it doesn't meaningfully affect job pickup latency.

This follows the same pattern already used by the leadership elector (elector.go:240) and maintenance services.

Test plan

  • New TestProducer_jitteredFetchPollInterval runs 1,000 iterations verifying all values stay within [FetchPollInterval, FetchPollInterval+FetchCooldown)
  • All existing TestProducer tests pass
  • No DB required for the new unit test

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@GiGurra
Copy link
Contributor Author

GiGurra commented Feb 13, 2026

Not sure: should this be configurable?
I can add config params for it if you want it

@GiGurra
Copy link
Contributor Author

GiGurra commented Feb 15, 2026

Full background disclosure (copying this from one of the the other PRs that was merged): I was quite lazy overall in these PRs (mostly just vibe:d during one evening with some questions from me to opus and tests added), but I hope they can be of some use.

Background: I had seen some odd and rather large discrete performance degradations over time at work (where we use river). I wanted, initially just for fun, to let opus 4.6 have a look at the river source code, to see if it could find any obvious places for improvements (and potential culprits for our performance degradations). 3 of these 4 PRs are mostly just small things, but the jitter one could potentially be what we need to solve our problems at work. It's a bit difficult to reproduce, as we have a couple of pods running and starting at different times - and only during DB connection resets/downtime would the stampeding herd kick in, I think (once per 1-2 weeks by just looking at our cpu use history so far, but then it stayed in that high cpu use state and only a redeploy of our pods would solve it).

Perhaps it would be possible to write some tests that demonstrate the issue in a good way

d := p.jitteredFetchPollInterval()
require.GreaterOrEqual(t, d, p.config.FetchPollInterval)
require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit silly opus generated code, but, I'll let you decide if you want to keep it or do something else with it. I usually tend to just accept Opus' choices in most cases, as long as it's isolated and doesn't risk any issues elsewhere

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah, a little contrived but it's probably okay.

Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Damn, this must be one of the oldest TODOs in the code base, so nice to see it resolved. Left one comment here, but I think it's largely good to go.

producer_test.go Outdated
// Run enough iterations to catch any out-of-bounds values without being
// flaky. The jitter range is [FetchPollInterval, FetchPollInterval +
// FetchCooldown), so [1s, 1.1s).
for range 1_000 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe bring this down to 100 or something. Both numbers are a little arbitrary, but feels like we don't need to crunch away 1k times.

d := p.jitteredFetchPollInterval()
require.GreaterOrEqual(t, d, p.config.FetchPollInterval)
require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. Yeah, a little contrived but it's probably okay.

producer.go Outdated
// synchronizing their fetches after a transient event (e.g. GC pause, network
// blip), which would cause periodic DB load spikes.
func (p *producer) jitteredFetchPollInterval() time.Duration {
return randutil.DurationBetween(p.config.FetchPollInterval, p.config.FetchPollInterval+p.config.FetchCooldown)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason to use p.config.FetchCooldown in particular as the added number to the poll interval here?

I'm just wondering if we should make it some other number. Like for example, max(p.config.FetchPollInterval * 0.1, 100 * time.Millisecond) or something of that nature so that it's more related to FetchPollInterval.

Copy link
Contributor Author

@GiGurra GiGurra Feb 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I was pushing back towards opus idea on that, I should have pushed more :).

Your suggestion here is better I would say, though, I would probably cap its lower end at 10 ms instead of 100ms

My other idea was to make it a separate config parameter entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@brandur I pushed a new commit which I think adjusts it more towards your idea

GiGurra and others added 2 commits February 15, 2026 20:15
Use 10% of FetchPollInterval (min 10ms) as the jitter range rather
than coupling it to the unrelated FetchCooldown config value.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@brandur brandur left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me. Thanks!

@brandur brandur merged commit e85c4ba into riverqueue:master Feb 15, 2026
11 checks passed
brandur added a commit that referenced this pull request Feb 15, 2026
Adds changelog entries for #1147, #1148, #1149, and #1150.

Also make a couple small tweaks: one to alphabetize a newly added
property, and one to move a success case test to above the error case,
which is a little more conventional.
brandur added a commit that referenced this pull request Feb 15, 2026
#1151)

Adds changelog entries for #1147, #1148, #1149, and #1150.

Also make a couple small tweaks: one to alphabetize a newly added
property, and one to move a success case test to above the error case,
which is a little more conventional.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants