Add timeout to standard pilot fetch by peter941221 · Pull Request #1255 · riverqueue/river

peter941221 · 2026-05-25T13:08:20Z

Summary

Fix StandardPilot.JobGetAvailable so a stalled fetch does not hang a producer indefinitely.

Problem

producer.dispatchWork intentionally strips cancellation from the work context before fetching jobs so an in-flight fetch is allowed to complete during shutdown:

producer.go:744-766

That is reasonable, but StandardPilot.JobGetAvailable forwarded directly to exec.JobGetAvailable with no timeout at all:

rivershared/riverpilot/standard_pilot.go:18-22

This meant a stalled driver call could block a standard-pilot producer forever. The pro pilot already applies per-attempt fetch timeouts, so the standard pilot was the outlier.

Change

Add a 10-second timeout inside StandardPilot.JobGetAvailable before calling the driver.

This keeps the existing shutdown semantics intact:

fetches still ignore parent cancellation from dispatchWork
but they are now bounded, so a wedged DB call eventually returns instead of freezing the producer forever

The timeout is local to the standard pilot so there is no driver SQL change and no producer state-machine change.

Testing

added rivershared/riverpilot/standard_pilot_test.go
covered MaxToLock <= 0 no-op behavior
covered a hung JobGetAvailable call timing out with context.DeadlineExceeded
covered parent cancellation still winning when the incoming context is already canceled

Verification

Locally verified with:

GOPROXY=https://goproxy.cn,direct GOSUMDB=off go test ./rivershared/riverpilot -count=1

Closes #1026.

[skip ci]

brandur

Thanks!

@bgentry Any strong opinions on how you want to handle this one? Another option is to just put the timeout in producer.go in dispatchWork:

func (p *producer) dispatchWork(workCtx context.Context, count int, fetchResultCh chan<- producerFetchResult) {
	// This intentionally removes any deadlines or cancellation from the parent
	// context because we don't want it to get cancelled if the producer is asked
	// to shut down. In that situation, we want to finish fetching any jobs we are
	// in the midst of fetching, work them, and then stop. Otherwise we'd have a
	// risk of shutting down when we had already fetched jobs in the database,
	// leaving those jobs stranded. We'd then potentially have to release them
	// back to the queue.
	ctx := context.WithoutCancel(workCtx)

	// Maximum size of the `attempted_by` array on each job row. This maximum is
	// rarely hit, but exists to protect against degenerate cases.
	const maxAttemptedBy = 100

	jobs, err := p.pilot.JobGetAvailable(ctx, p.exec, p.state, &riverdriver.JobGetAvailableParams{
		ClientID:       p.config.ClientID,
		MaxAttemptedBy: maxAttemptedBy,
		MaxToLock:      count,
		Now:            p.Time.NowOrNil(),
		Queue:          p.config.Queue,
		ProducerID:     p.id.Load(),
		Schema:         p.config.Schema,
	})
	if err != nil {
		fetchResultCh <- producerFetchResult{err: err}
		return
	}

	fetchResultCh <- producerFetchResult{jobs: jobs}
}

That might be better in the way that not every pilot needs to remember to bring its own context cancellations. That said, maybe in this case we might want a longer cancellation for the pro pilot so it'd make sense to break up the two.

bgentry · 2026-05-28T15:29:22Z

Another more robust option is to do fetches within a transaction that sets a statement_timeout at the DB level. This is far more robust because it's not prone to accidentally stranding jobs, although the extra round trips may impact throughput. As long as the Go side waits for db_timeout + margin it'd be super unlikely to strand jobs. We basically do this on the Pro side already due to extra logic running within fetches. Thoughts @brandur?

brandur · 2026-05-30T17:18:42Z

@bgentry WFM. It'd probably be worth re-running the benchmark on the branch to verify no major degradation in performance, but given the fetch queries are relatively few compared to everything else, hopefully there wouldn't be.

You were previously against use of statement_timeout with a context cancellation though weren't you? You're saying there'd still be both here right?

peter941221 · 2026-05-30T22:52:56Z

@bgentry @brandur
I put it on StandardPilot first because I wasn’t sure yet whether ProPilot should inherit the same timeout semantics.
If both pilots are really meant to have the same JobGetAvailable cancellation policy, then I agree dispatchWork is probably the cleaner home for it since it makes the timeout automatic instead of something each pilot has to remember.
My hesitation was exactly the case you called out: if ProPilot wants a meaningfully longer timeout, pushing it up immediately may be the wrong abstraction. I can take a closer look at the pro path and either move this up to dispatchWork or keep it split if the semantics are intentionally different.

brandur · 2026-05-31T03:53:45Z

@peter941221 Yep, makes sense. Thanks.

peter941221 · 2026-05-31T04:31:02Z

I’m going to keep this PR scoped to the StandardPilot timeout.
I don’t want to move the timeout up to dispatchWork or switch this branch to a transaction + statement_timeout design until I’ve checked whether the Pro fetch path is meant to keep different timeout semantics.
If we do want one policy across both pilots, I’d rather handle that as a separate follow-up with a benchmark around the extra round trips.

brandur · 2026-05-31T19:37:41Z

Thx. Gonna pull this in and make a few tweaks on top.

Follows up #1255 to add a `statement_timeout` in addition to the Go context timeout. `statement_timeout` will give us a better error message, and also minimizes the chances of accidentally locking rows that won't be work if we had an operation that ran long, succeeded, but then was immediately cancelled as Go's context timeout ran out. `statement_timeout` is Postgres only, so the code is a little gnarlier than would be desirable in that we add a `SetLocalStatementTimeout` function to driver `ExecutorTx`, but which is a no-op on some databases like SQLite. We try to clarify in documentation that it needs to be used in addition to context timeout, not instead of it, because it may no-op depending on the database. I also increased the timeout to 30 seconds. This matches our timeouts in the various maintenance modules, and seems a little safer as job locks on tables with huge numbers of dead rows could potentially take over 10 seconds, and maybe some users have this happening. IMO, it's too random still where we put this stuff in, but we'll have to figure that out on follow up changes. e.g. Why do we do a statement timeout for locking jobs but not for maintenance operations? Hard to justify.

peter941221 added 2 commits May 25, 2026 21:07

Add timeout to standard pilot fetch

8499d43

[skip ci]

Update changelog for standard pilot fetch timeout

9d5b603

[skip ci]

brandur reviewed May 26, 2026

View reviewed changes

Comment thread rivershared/riverpilot/standard_pilot.go Outdated

brandur reviewed May 26, 2026

View reviewed changes

Simplify standard pilot fetch timeout

5cc9e93

peter941221 marked this pull request as ready for review May 29, 2026 10:17

Merge branch 'master' into fix/standard-pilot-fetch-timeout

598d398

brandur merged commit 965dbad into riverqueue:master May 31, 2026
12 checks passed

brandur mentioned this pull request May 31, 2026

Add statement_timeout on fetching available jobs #1263

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add timeout to standard pilot fetch#1255

Add timeout to standard pilot fetch#1255
brandur merged 4 commits into
riverqueue:masterfrom
peter941221:fix/standard-pilot-fetch-timeout

peter941221 commented May 25, 2026

Uh oh!

Uh oh!

brandur left a comment

Uh oh!

bgentry commented May 28, 2026

Uh oh!

brandur commented May 30, 2026

Uh oh!

peter941221 commented May 30, 2026

Uh oh!

brandur commented May 31, 2026

Uh oh!

peter941221 commented May 31, 2026

Uh oh!

brandur commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

peter941221 commented May 25, 2026

Summary

Problem

Change

Testing

Verification

Uh oh!

Uh oh!

brandur left a comment

Choose a reason for hiding this comment

Uh oh!

bgentry commented May 28, 2026

Uh oh!

brandur commented May 30, 2026

Uh oh!

peter941221 commented May 30, 2026

Uh oh!

brandur commented May 31, 2026

Uh oh!

peter941221 commented May 31, 2026

Uh oh!

brandur commented May 31, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants