[conversation_replay] force min_tokens == max_tokens for deterministic output length#497
Open
LoganVegnaSHOP wants to merge 1 commit into
Conversation
…c output length Conversation replay benchmarks compare model throughput by sampling a fixed output-length distribution per turn and asserting that each turn emits exactly that many tokens. In practice the sampled length is not honored: different model families terminate at different chat-template stop tokens, so the same workload yields different total output token counts on Qwen vs Llama vs Nemotron, which invalidates tok/sec and convos/hr comparisons. `ignore_eos=True` does not fix this. vLLM's `ignore_eos` only suppresses the single canonical `tokenizer.eos_token_id`; it does not suppress the rest of the ids in `generation_config.eos_token_id`. Qwen3's list is `[151645, 151643]` (`<|im_end|>`, `<|endoftext|>`); Llama3's is `[128001, 128008, 128009]`. Once the chat template stop token fires, generation stops regardless of `ignore_eos`. The fix is vLLM's `min_tokens` parameter, which is checked before any stop condition (EOS, stop_token_ids, custom stops). Setting `min_tokens == max_tokens` forces deterministic per-turn output length across every model family. Changes: - `apis/completion.py`: add `min_tokens: Optional[int] = None` to `CompletionAPIData`; include in request body when set. Forwarded by `UserSessionCompletionAPIData` via the existing `super().to_request_body()` call, so conversation_replay picks it up. - `datagen/conversation_replay_datagen.py`: set `min_tokens=bp.turn_output_lens[turn_idx]` on every emitted `_ConversationReplayAPIData`, parallel to the existing `max_tokens`. Tests: - `min_tokens` is omitted from the payload when unset (back-compat for callers that don't need it). - When set, it round-trips through `to_request_body` verbatim. - Every per-turn payload produced by `ConversationReplayDataGenerator` has `min_tokens == max_tokens == turn_output_lens[i]`. Design note: `min_tokens` is always set for conversation_replay with no config knob. The whole point of conversation_replay is reproducibility; making deterministic length opt-in defeats the purpose, and there is no scenario where a user wants the cross-model drift behavior back.
Contributor
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: LoganVegnaSHOP The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
LoganVegnaSHOP
added a commit
to LoganVegnaSHOP/inference-perf
that referenced
this pull request
May 20, 2026
…c output length Cherry-pick of upstream PR kubernetes-sigs#497 (kubernetes-sigs#497) ported to the fork's pre-rename API surface (`to_payload` instead of `to_request_body`). Drop this commit and rebase on top of upstream main once kubernetes-sigs#497 merges. vLLM's `ignore_eos` only suppresses the canonical `tokenizer.eos_token_id`, not the rest of `generation_config.eos_token_id`. Qwen3, Llama3, and Nemotron each have additional chat-template stop tokens (`<|im_end|>`, `<|eot_id|>`, etc.) that terminate generation early, so the same conversation_replay workload produces different total output token counts across models — invalidating cross-model tok/sec comparisons. Fix: set `min_tokens == max_tokens` per turn. vLLM's `min_tokens` is checked before any stop condition, so the model is forced to emit exactly the sampled `turn_output_lens[i]` regardless of which stop tokens its generation_config declares. - `apis/completion.py`: add `min_tokens: Optional[int] = None` to `CompletionAPIData`; include in payload when set. Propagated through `UserSessionCompletionAPIData` via its existing `super().to_payload()` call. - `datagen/conversation_replay_datagen.py`: set `min_tokens=bp.turn_output_lens[turn_idx]` parallel to `max_tokens` on every per-turn `_ConversationReplayAPIData`.
Bslabe123
reviewed
May 21, 2026
| @@ -39,14 +46,17 @@ async def to_request_body( | |||
| ) -> RequestBody: | |||
| if self.max_tokens == 0: | |||
Contributor
There was a problem hiding this comment.
style nit: colocate the min_tokens and max_tokens checks?
Contributor
|
Is this change limited to completions or should there also be a similar change in chat-completions? |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Conversation-replay benchmarks compare model throughput by sampling a fixed output-length distribution per turn (
turn_output_lens) and assuming each turn emits exactly that many tokens. In practice the sampled length is not honored: different model families terminate at different chat-template stop tokens, so the same workload yields different total output token counts on Qwen vs Llama vs Nemotron. This invalidatestok/secandconvos/hrcomparisons across models — TPOT remains valid, but the per-turn token budget does not.ignore_eos=Truedoes not fix this. vLLM'signore_eosonly suppresses the single canonicaltokenizer.eos_token_id; it does not suppress the rest of the ids ingeneration_config.eos_token_id. Examples:eos_token_id = [151645, 151643]—<|im_end|>,<|endoftext|>eos_token_id = [128001, 128008, 128009]Once a chat-template stop token fires, generation stops regardless of
ignore_eos. In live runs we've seen the same workload produce ~585 output tokens on one model and ~280 on another for the samemax_tokens/ignore_eos=Truesettings, purely because the second model's<|im_end|>slipped past the EOS filter.Fix
Use vLLM's
min_tokens, which is checked before any stop condition (EOS,stop_token_ids, custom stops). Settingmin_tokens == max_tokensforces deterministic per-turn output length across every model family.Changes
inference_perf/apis/completion.py— addmin_tokens: Optional[int] = NonetoCompletionAPIDataand include it in the request body when set.UserSessionCompletionAPIData.to_request_bodyalready callssuper().to_request_body(), so the field is propagated through to conversation_replay automatically.inference_perf/datagen/conversation_replay_datagen.py— setmin_tokens=bp.turn_output_lens[turn_idx]parallel to the existingmax_tokenswhen constructing each per-turn_ConversationReplayAPIData.Tests
min_tokensis omitted from the payload when unset (back-compat for the other API surfaces).to_request_bodyverbatim alongsidemax_tokens.ConversationReplayDataGeneratorhasmin_tokens == max_tokens == turn_output_lens[i]for the entire materialized stream.Full suite still green: 288/288.
Design note
min_tokensis always set for conversation_replay with no config knob. The point of conversation_replay is reproducibility; making deterministic length opt-in defeats the purpose, and there is no scenario where a user wants the cross-model drift behavior back. Other API types (CompletionAPIDataused outside conversation_replay) are unaffected becausemin_tokensdefaults toNone.Caveats
max_tokenswhenprompt_tokens + max_tokens > max_model_len. If the prompt grows that large during a session, vLLM may emit fewer thanmin_tokenstokens and error or truncate. Watcherror_rateand tuneoutput_tokens_per_turnagainst yourmax_model_len.min_tokensis a vLLM extension, not OpenAI-spec. Other OpenAI-compatible backends will ignore the field — the existing drift behavior persists there. This PR doesn't change behavior for non-vLLM servers.