Fix AppendNextTokensToSequences heap overflow by apsonawane · Pull Request #2111 · microsoft/onnxruntime-genai

apsonawane · 2026-04-30T20:54:55Z

This pull request adds robust bounds checking to prevent writing past the allocated sequence buffer during token generation, both on CPU and CUDA backends. It ensures that sequence appending operations do not exceed the configured max_length, and introduces early error handling and logging when the maximum length is reached. These changes improve the safety and stability of sequence generation, preventing out-of-bounds writes and making debugging easier.

Sequence bounds checking and safety:

Added explicit checks in GreedySearch_Cuda::SampleTopKTopP, GreedySearch_Cuda::AppendTokens, GreedySearch_Cpu::AppendNextTokensToSequences, and BeamSearch_Cpu::AppendNextTokensToSequences to ensure tokens are only appended if the current sequence length is less than max_length, preventing buffer overflows. [1] [2] [3] [4]
Updated conditions to use >= instead of == when checking if the sequence has reached max_length, ensuring no further tokens are appended once the limit is reached. [1] [2]

Error handling and logging:

In Generator::GenerateNextToken, added an early runtime error if called when the sequence is already at max_length, providing a clear message for debugging.
Added logging for when the maximum sequence length is hit, making it easier to trace and debug issues related to buffer limits. [1] [2] [3]

State management improvements:

Ensured that the done_ state is only reset if the sequence buffer is not full, preserving the correct state and preventing accidental out-of-bounds writes on subsequent calls. [1] [2]

Copilot

Pull request overview

This PR aims to prevent out-of-bounds writes into the preallocated sequence buffers during generation by adding explicit max_length bounds checks across CPU and CUDA search implementations, and by failing fast when GenerateNextToken() is called after reaching max_length.

Changes:

Added early-return bounds checks in CPU greedy/beam append paths to prevent writing past the sequences buffer.
Added CUDA-side guards to avoid launching append kernels once the sequence length reaches max_length, and updated done/max-length checks.
Added an early GenerateNextToken() runtime error when sequence length is already at max_length.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/search.cpp`	Adds CPU-side bounds checks and adjusts done/reset behavior around appending tokens.
`src/generators.cpp`	Adds a fast-fail guard in `GenerateNextToken()` when already at `max_length`.
`src/cuda/search_cuda.cpp`	Adds CUDA-side conditional guards to avoid appending once at/over `max_length`, plus logging/done handling.

kunal-vaishnavi · 2026-05-12T17:16:19Z

The changes in this PR seem to indicate a larger issue inside ORT GenAI. The done state should reflect the true state, and we should not have to insert max length checks in so many locations. Do we know why this issue is happening in the first place?

baijumeswani · 2026-05-20T06:48:26Z

  }

-  if (sequences_.GetSequenceLength() == params_->search.max_length) {
+  if (sequences_.GetSequenceLength() >= params_->search.max_length) {


GetSequenceLength starts at 0 and increases as token count increases. max_length is expected to be non zero. We have checks in place in generators.cpp to ensure that the tokens appended do not exceed max_length.

I am not sure I see where the heap overflow may occur. Perhaps we need to document where the checks are happening so we have a good understanding. But I do not think there is a need for this change.

Was there a heap-overflow encountered at runtime?

heap-overflow is caused by this

AppendTokens() fills buffer to max_length, then ResetDone() clears done_

Next GenerateNextToken() -> SampleTopk() -> !done_ passes -> AppendNextTokensToSequences() writes at index max_length -> OOB write

Fix AppendNextTokensToSequences heap overflow

38594fa

Copilot AI review requested due to automatic review settings April 30, 2026 20:54

Copilot started reviewing on behalf of apsonawane April 30, 2026 20:55 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Comment thread src/generators.cpp Outdated

Comment thread src/search.cpp Outdated

apsonawane added 2 commits April 30, 2026 15:01

address comments

a34a7f9

Merge branch 'main' into asonawane/heap

e8d3c72

apsonawane requested a review from a team as a code owner May 5, 2026 19:05

baijumeswani reviewed May 20, 2026

View reviewed changes

apsonawane added 3 commits May 20, 2026 13:13

Merge branch 'main' into asonawane/heap

e571280

Cleanup

465f01f

Fix

e7cde34

apsonawane enabled auto-merge (squash) May 20, 2026 21:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix AppendNextTokensToSequences heap overflow#2111

Fix AppendNextTokensToSequences heap overflow#2111
apsonawane wants to merge 6 commits into
mainfrom
asonawane/heap

apsonawane commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

kunal-vaishnavi commented May 12, 2026

Uh oh!

baijumeswani May 20, 2026

Uh oh!

apsonawane May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

apsonawane commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

kunal-vaishnavi commented May 12, 2026

Uh oh!

baijumeswani May 20, 2026

Choose a reason for hiding this comment

Uh oh!

apsonawane May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants