Fix AppendNextTokensToSequences heap overflow#2111
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent out-of-bounds writes into the preallocated sequence buffers during generation by adding explicit max_length bounds checks across CPU and CUDA search implementations, and by failing fast when GenerateNextToken() is called after reaching max_length.
Changes:
- Added early-return bounds checks in CPU greedy/beam append paths to prevent writing past the sequences buffer.
- Added CUDA-side guards to avoid launching append kernels once the sequence length reaches
max_length, and updated done/max-length checks. - Added an early
GenerateNextToken()runtime error when sequence length is already atmax_length.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/search.cpp |
Adds CPU-side bounds checks and adjusts done/reset behavior around appending tokens. |
src/generators.cpp |
Adds a fast-fail guard in GenerateNextToken() when already at max_length. |
src/cuda/search_cuda.cpp |
Adds CUDA-side conditional guards to avoid appending once at/over max_length, plus logging/done handling. |
|
The changes in this PR seem to indicate a larger issue inside ORT GenAI. The |
| } | ||
|
|
||
| if (sequences_.GetSequenceLength() == params_->search.max_length) { | ||
| if (sequences_.GetSequenceLength() >= params_->search.max_length) { |
There was a problem hiding this comment.
GetSequenceLength starts at 0 and increases as token count increases. max_length is expected to be non zero. We have checks in place in generators.cpp to ensure that the tokens appended do not exceed max_length.
I am not sure I see where the heap overflow may occur. Perhaps we need to document where the checks are happening so we have a good understanding. But I do not think there is a need for this change.
Was there a heap-overflow encountered at runtime?
There was a problem hiding this comment.
heap-overflow is caused by this
AppendTokens()fills buffer tomax_length, thenResetDone()clears done_- Next
GenerateNextToken()->SampleTopk()->!done_passes ->AppendNextTokensToSequences()writes at indexmax_length-> OOB write
This pull request adds robust bounds checking to prevent writing past the allocated sequence buffer during token generation, both on CPU and CUDA backends. It ensures that sequence appending operations do not exceed the configured
max_length, and introduces early error handling and logging when the maximum length is reached. These changes improve the safety and stability of sequence generation, preventing out-of-bounds writes and making debugging easier.Sequence bounds checking and safety:
GreedySearch_Cuda::SampleTopKTopP,GreedySearch_Cuda::AppendTokens,GreedySearch_Cpu::AppendNextTokensToSequences, andBeamSearch_Cpu::AppendNextTokensToSequencesto ensure tokens are only appended if the current sequence length is less thanmax_length, preventing buffer overflows. [1] [2] [3] [4]>=instead of==when checking if the sequence has reachedmax_length, ensuring no further tokens are appended once the limit is reached. [1] [2]Error handling and logging:
Generator::GenerateNextToken, added an early runtime error if called when the sequence is already atmax_length, providing a clear message for debugging.State management improvements:
done_state is only reset if the sequence buffer is not full, preserving the correct state and preventing accidental out-of-bounds writes on subsequent calls. [1] [2]