Add jitter to fetch poll loop to prevent producer stampeding by GiGurra · Pull Request #1150 · riverqueue/river

GiGurra · 2026-02-13T21:02:09Z

Summary

Add random jitter of [0, FetchCooldown) to each fetchPollLoop timer reset to prevent multiple producers from phase-locking their poll timers
Remove the resolved TODO comment from producer.go
Add test verifying jitter stays within expected bounds

The problem

When multiple River producers are connected to the same Postgres, their fetchPollLoop timers can phase-lock after a transient event, causing a permanent increase in DB load:

Producers start at different times, so their poll timers are naturally spread out across the FetchPollInterval (default 1s). DB load is low and even.
A transient event occurs — a GC pause, a network blip, Postgres connection pool saturation, or a burst of LISTEN/NOTIFY — causing all producer goroutines to stall briefly.
When they resume, all accumulated fetchPollLoop timers fire at the same instant and Reset(FetchPollInterval) from that same moment. They are now synchronized.
Since the interval is fixed with no jitter, they stay synchronized permanently — all N producers hit the database at the same millisecond, every second, forever.

Before the sync event: N producers spread across 1s → low, even load.
After: N producers hitting the DB at the same millisecond every 1s → periodic spikes that raise average CPU baseline. This manifests as a sudden, permanent step-up in CPU usage that never recovers.

The fix

Add [0, FetchCooldown) of random jitter to each poll timer reset (via the existing randutil.DurationBetween utility). With default config values, the poll interval becomes [1s, 1.1s) instead of a fixed 1s.

This means even if timers accidentally synchronize, they naturally drift apart within a few cycles. The jitter is bounded by FetchCooldown (default 100ms), which is small relative to FetchPollInterval (default 1s), so it doesn't meaningfully affect job pickup latency.

This follows the same pattern already used by the leadership elector (elector.go:240) and maintenance services.

Test plan

New TestProducer_jitteredFetchPollInterval runs 1,000 iterations verifying all values stay within [FetchPollInterval, FetchPollInterval+FetchCooldown)
All existing TestProducer tests pass
No DB required for the new unit test

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GiGurra · 2026-02-13T21:09:23Z

Not sure: should this be configurable?
I can add config params for it if you want it

GiGurra · 2026-02-15T18:12:19Z

Full background disclosure (copying this from one of the the other PRs that was merged): I was quite lazy overall in these PRs (mostly just vibe:d during one evening with some questions from me to opus and tests added), but I hope they can be of some use.

Background: I had seen some odd and rather large discrete performance degradations over time at work (where we use river). I wanted, initially just for fun, to let opus 4.6 have a look at the river source code, to see if it could find any obvious places for improvements (and potential culprits for our performance degradations). 3 of these 4 PRs are mostly just small things, but the jitter one could potentially be what we need to solve our problems at work. It's a bit difficult to reproduce, as we have a couple of pods running and starting at different times - and only during DB connection resets/downtime would the stampeding herd kick in, I think (once per 1-2 weeks by just looking at our cpu use history so far, but then it stayed in that high cpu use state and only a redeploy of our pods would solve it).

Perhaps it would be possible to write some tests that demonstrate the issue in a good way

GiGurra · 2026-02-15T18:18:44Z

producer_test.go

+		d := p.jitteredFetchPollInterval()
+		require.GreaterOrEqual(t, d, p.config.FetchPollInterval)
+		require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown)
+	}


This is a bit silly opus generated code, but, I'll let you decide if you want to keep it or do something else with it. I usually tend to just accept Opus' choices in most cases, as long as it's isolated and doesn't risk any issues elsewhere

Gotcha. Yeah, a little contrived but it's probably okay.

brandur

Nice! Damn, this must be one of the oldest TODOs in the code base, so nice to see it resolved. Left one comment here, but I think it's largely good to go.

brandur · 2026-02-15T19:05:39Z

producer_test.go

+	// Run enough iterations to catch any out-of-bounds values without being
+	// flaky. The jitter range is [FetchPollInterval, FetchPollInterval +
+	// FetchCooldown), so [1s, 1.1s).
+	for range 1_000 {


Maybe bring this down to 100 or something. Both numbers are a little arbitrary, but feels like we don't need to crunch away 1k times.

brandur · 2026-02-15T19:05:51Z

producer_test.go

+		d := p.jitteredFetchPollInterval()
+		require.GreaterOrEqual(t, d, p.config.FetchPollInterval)
+		require.Less(t, d, p.config.FetchPollInterval+p.config.FetchCooldown)
+	}


Gotcha. Yeah, a little contrived but it's probably okay.

brandur · 2026-02-15T19:07:35Z

producer.go

+// synchronizing their fetches after a transient event (e.g. GC pause, network
+// blip), which would cause periodic DB load spikes.
+func (p *producer) jitteredFetchPollInterval() time.Duration {
+	return randutil.DurationBetween(p.config.FetchPollInterval, p.config.FetchPollInterval+p.config.FetchCooldown)


Is there any particular reason to use p.config.FetchCooldown in particular as the added number to the poll interval here?

I'm just wondering if we should make it some other number. Like for example, max(p.config.FetchPollInterval * 0.1, 100 * time.Millisecond) or something of that nature so that it's more related to FetchPollInterval.

Honestly I was pushing back towards opus idea on that, I should have pushed more :).

Your suggestion here is better I would say, though, I would probably cap its lower end at 10 ms instead of 100ms

My other idea was to make it a separate config parameter entirely

@brandur I pushed a new commit which I think adjusts it more towards your idea

Use 10% of FetchPollInterval (min 10ms) as the jitter range rather than coupling it to the unrelated FetchCooldown config value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brandur

Works for me. Thanks!

Adds changelog entries for #1147, #1148, #1149, and #1150. Also make a couple small tweaks: one to alphabetize a newly added property, and one to move a success case test to above the error case, which is a little more conventional.

#1151) Adds changelog entries for #1147, #1148, #1149, and #1150. Also make a couple small tweaks: one to alphabetize a newly added property, and one to move a success case test to above the error case, which is a little more conventional.

Add jitter to fetch poll loop to prevent producer stampeding

4686256

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

GiGurra commented Feb 15, 2026

View reviewed changes

brandur reviewed Feb 15, 2026

View reviewed changes

GiGurra and others added 2 commits February 15, 2026 20:15

Base jitter range on poll interval instead of cooldown

651ea8d

Use 10% of FetchPollInterval (min 10ms) as the jitter range rather than coupling it to the unrelated FetchCooldown config value. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Fix varnamelen lint: rename p to prod in jitter test

161a4f6

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

brandur approved these changes Feb 15, 2026

View reviewed changes

brandur merged commit e85c4ba into riverqueue:master Feb 15, 2026
11 checks passed

brandur mentioned this pull request Feb 15, 2026

Add changelog entries for recent contributions + a couple minor tweaks #1151

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add jitter to fetch poll loop to prevent producer stampeding#1150

Add jitter to fetch poll loop to prevent producer stampeding#1150
brandur merged 3 commits intoriverqueue:masterfrom
GiGurra:fix/fetch-poll-jitter

GiGurra commented Feb 13, 2026

Uh oh!

GiGurra commented Feb 13, 2026 •

edited

Loading

Uh oh!

GiGurra commented Feb 15, 2026 •

edited

Loading

Uh oh!

GiGurra Feb 15, 2026

Uh oh!

brandur Feb 15, 2026

Uh oh!

brandur left a comment

Uh oh!

brandur Feb 15, 2026

Uh oh!

brandur Feb 15, 2026

Uh oh!

brandur Feb 15, 2026

Uh oh!

GiGurra Feb 15, 2026 •

edited

Loading

Uh oh!

GiGurra Feb 15, 2026

Uh oh!

brandur left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

GiGurra commented Feb 13, 2026

Summary

The problem

The fix

Test plan

Uh oh!

GiGurra commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiGurra commented Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

GiGurra Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

brandur left a comment

Choose a reason for hiding this comment

Uh oh!

brandur Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

brandur Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

GiGurra Feb 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

GiGurra Feb 15, 2026

Choose a reason for hiding this comment

Uh oh!

brandur left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GiGurra commented Feb 13, 2026 •

edited

Loading

GiGurra commented Feb 15, 2026 •

edited

Loading

GiGurra Feb 15, 2026 •

edited

Loading