[conversation_replay] force min_tokens == max_tokens for deterministic output length by LoganVegnaSHOP · Pull Request #497 · kubernetes-sigs/inference-perf

LoganVegnaSHOP · 2026-05-20T15:50:54Z

Problem

Conversation-replay benchmarks compare model throughput by sampling a fixed output-length distribution per turn (turn_output_lens) and assuming each turn emits exactly that many tokens. In practice the sampled length is not honored: different model families terminate at different chat-template stop tokens, so the same workload yields different total output token counts on Qwen vs Llama vs Nemotron. This invalidates tok/sec and convos/hr comparisons across models — TPOT remains valid, but the per-turn token budget does not.

ignore_eos=True does not fix this. vLLM's ignore_eos only suppresses the single canonical tokenizer.eos_token_id; it does not suppress the rest of the ids in generation_config.eos_token_id. Examples:

Qwen3: eos_token_id = [151645, 151643] — <|im_end|>, <|endoftext|>
Llama3: eos_token_id = [128001, 128008, 128009]

Once a chat-template stop token fires, generation stops regardless of ignore_eos. In live runs we've seen the same workload produce ~585 output tokens on one model and ~280 on another for the same max_tokens/ignore_eos=True settings, purely because the second model's <|im_end|> slipped past the EOS filter.

Fix

Use vLLM's min_tokens, which is checked before any stop condition (EOS, stop_token_ids, custom stops). Setting min_tokens == max_tokens forces deterministic per-turn output length across every model family.

Changes

inference_perf/apis/completion.py — add min_tokens: Optional[int] = None to CompletionAPIData and include it in the request body when set. UserSessionCompletionAPIData.to_request_body already calls super().to_request_body(), so the field is propagated through to conversation_replay automatically.
inference_perf/datagen/conversation_replay_datagen.py — set min_tokens=bp.turn_output_lens[turn_idx] parallel to the existing max_tokens when constructing each per-turn _ConversationReplayAPIData.

Tests

min_tokens is omitted from the payload when unset (back-compat for the other API surfaces).
When set, it round-trips through to_request_body verbatim alongside max_tokens.
Every per-turn payload produced by ConversationReplayDataGenerator has min_tokens == max_tokens == turn_output_lens[i] for the entire materialized stream.

Full suite still green: 288/288.

Design note

min_tokens is always set for conversation_replay with no config knob. The point of conversation_replay is reproducibility; making deterministic length opt-in defeats the purpose, and there is no scenario where a user wants the cross-model drift behavior back. Other API types (CompletionAPIData used outside conversation_replay) are unaffected because min_tokens defaults to None.

Caveats

vLLM still clamps max_tokens when prompt_tokens + max_tokens > max_model_len. If the prompt grows that large during a session, vLLM may emit fewer than min_tokens tokens and error or truncate. Watch error_rate and tune output_tokens_per_turn against your max_model_len.
min_tokens is a vLLM extension, not OpenAI-spec. Other OpenAI-compatible backends will ignore the field — the existing drift behavior persists there. This PR doesn't change behavior for non-vLLM servers.

…c output length Conversation replay benchmarks compare model throughput by sampling a fixed output-length distribution per turn and asserting that each turn emits exactly that many tokens. In practice the sampled length is not honored: different model families terminate at different chat-template stop tokens, so the same workload yields different total output token counts on Qwen vs Llama vs Nemotron, which invalidates tok/sec and convos/hr comparisons. `ignore_eos=True` does not fix this. vLLM's `ignore_eos` only suppresses the single canonical `tokenizer.eos_token_id`; it does not suppress the rest of the ids in `generation_config.eos_token_id`. Qwen3's list is `[151645, 151643]` (`<|im_end|>`, `<|endoftext|>`); Llama3's is `[128001, 128008, 128009]`. Once the chat template stop token fires, generation stops regardless of `ignore_eos`. The fix is vLLM's `min_tokens` parameter, which is checked before any stop condition (EOS, stop_token_ids, custom stops). Setting `min_tokens == max_tokens` forces deterministic per-turn output length across every model family. Changes: - `apis/completion.py`: add `min_tokens: Optional[int] = None` to `CompletionAPIData`; include in request body when set. Forwarded by `UserSessionCompletionAPIData` via the existing `super().to_request_body()` call, so conversation_replay picks it up. - `datagen/conversation_replay_datagen.py`: set `min_tokens=bp.turn_output_lens[turn_idx]` on every emitted `_ConversationReplayAPIData`, parallel to the existing `max_tokens`. Tests: - `min_tokens` is omitted from the payload when unset (back-compat for callers that don't need it). - When set, it round-trips through `to_request_body` verbatim. - Every per-turn payload produced by `ConversationReplayDataGenerator` has `min_tokens == max_tokens == turn_output_lens[i]`. Design note: `min_tokens` is always set for conversation_replay with no config knob. The whole point of conversation_replay is reproducibility; making deterministic length opt-in defeats the purpose, and there is no scenario where a user wants the cross-model drift behavior back.

k8s-ci-robot · 2026-05-20T15:51:04Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LoganVegnaSHOP
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…c output length Cherry-pick of upstream PR kubernetes-sigs#497 (kubernetes-sigs#497) ported to the fork's pre-rename API surface (`to_payload` instead of `to_request_body`). Drop this commit and rebase on top of upstream main once kubernetes-sigs#497 merges. vLLM's `ignore_eos` only suppresses the canonical `tokenizer.eos_token_id`, not the rest of `generation_config.eos_token_id`. Qwen3, Llama3, and Nemotron each have additional chat-template stop tokens (`<|im_end|>`, `<|eot_id|>`, etc.) that terminate generation early, so the same conversation_replay workload produces different total output token counts across models — invalidating cross-model tok/sec comparisons. Fix: set `min_tokens == max_tokens` per turn. vLLM's `min_tokens` is checked before any stop condition, so the model is forced to emit exactly the sampled `turn_output_lens[i]` regardless of which stop tokens its generation_config declares. - `apis/completion.py`: add `min_tokens: Optional[int] = None` to `CompletionAPIData`; include in payload when set. Propagated through `UserSessionCompletionAPIData` via its existing `super().to_payload()` call. - `datagen/conversation_replay_datagen.py`: set `min_tokens=bp.turn_output_lens[turn_idx]` parallel to `max_tokens` on every per-turn `_ConversationReplayAPIData`.

Bslabe123 · 2026-05-21T16:40:50Z

@@ -39,14 +46,17 @@ async def to_request_body(
    ) -> RequestBody:
        if self.max_tokens == 0:


style nit: colocate the min_tokens and max_tokens checks?

Bslabe123 · 2026-05-21T16:45:33Z

Is this change limited to completions or should there also be a similar change in chat-completions?

k8s-ci-robot requested review from Bslabe123 and terrytangyuan May 20, 2026 15:51

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2026

Bslabe123 reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[conversation_replay] force min_tokens == max_tokens for deterministic output length#497

[conversation_replay] force min_tokens == max_tokens for deterministic output length#497
LoganVegnaSHOP wants to merge 1 commit into
kubernetes-sigs:mainfrom
LoganVegnaSHOP:feat/min-tokens-deterministic-output

LoganVegnaSHOP commented May 20, 2026

Uh oh!

k8s-ci-robot commented May 20, 2026

Uh oh!

Bslabe123 May 21, 2026

Uh oh!

Bslabe123 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -39,14 +46,17 @@ async def to_request_body(
		) -> RequestBody:
		if self.max_tokens == 0:

Conversation

LoganVegnaSHOP commented May 20, 2026

Problem

Fix

Changes

Tests

Design note

Caveats

Uh oh!

k8s-ci-robot commented May 20, 2026

Uh oh!

Bslabe123 May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Bslabe123 commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants