Skip to content

[conversation_replay] force min_tokens == max_tokens for deterministic output length#497

Open
LoganVegnaSHOP wants to merge 1 commit into
kubernetes-sigs:mainfrom
LoganVegnaSHOP:feat/min-tokens-deterministic-output
Open

[conversation_replay] force min_tokens == max_tokens for deterministic output length#497
LoganVegnaSHOP wants to merge 1 commit into
kubernetes-sigs:mainfrom
LoganVegnaSHOP:feat/min-tokens-deterministic-output

Conversation

@LoganVegnaSHOP
Copy link
Copy Markdown
Contributor

Problem

Conversation-replay benchmarks compare model throughput by sampling a fixed output-length distribution per turn (turn_output_lens) and assuming each turn emits exactly that many tokens. In practice the sampled length is not honored: different model families terminate at different chat-template stop tokens, so the same workload yields different total output token counts on Qwen vs Llama vs Nemotron. This invalidates tok/sec and convos/hr comparisons across models — TPOT remains valid, but the per-turn token budget does not.

ignore_eos=True does not fix this. vLLM's ignore_eos only suppresses the single canonical tokenizer.eos_token_id; it does not suppress the rest of the ids in generation_config.eos_token_id. Examples:

  • Qwen3: eos_token_id = [151645, 151643]<|im_end|>, <|endoftext|>
  • Llama3: eos_token_id = [128001, 128008, 128009]

Once a chat-template stop token fires, generation stops regardless of ignore_eos. In live runs we've seen the same workload produce ~585 output tokens on one model and ~280 on another for the same max_tokens/ignore_eos=True settings, purely because the second model's <|im_end|> slipped past the EOS filter.

Fix

Use vLLM's min_tokens, which is checked before any stop condition (EOS, stop_token_ids, custom stops). Setting min_tokens == max_tokens forces deterministic per-turn output length across every model family.

Changes

  • inference_perf/apis/completion.py — add min_tokens: Optional[int] = None to CompletionAPIData and include it in the request body when set. UserSessionCompletionAPIData.to_request_body already calls super().to_request_body(), so the field is propagated through to conversation_replay automatically.
  • inference_perf/datagen/conversation_replay_datagen.py — set min_tokens=bp.turn_output_lens[turn_idx] parallel to the existing max_tokens when constructing each per-turn _ConversationReplayAPIData.

Tests

  • min_tokens is omitted from the payload when unset (back-compat for the other API surfaces).
  • When set, it round-trips through to_request_body verbatim alongside max_tokens.
  • Every per-turn payload produced by ConversationReplayDataGenerator has min_tokens == max_tokens == turn_output_lens[i] for the entire materialized stream.

Full suite still green: 288/288.

Design note

min_tokens is always set for conversation_replay with no config knob. The point of conversation_replay is reproducibility; making deterministic length opt-in defeats the purpose, and there is no scenario where a user wants the cross-model drift behavior back. Other API types (CompletionAPIData used outside conversation_replay) are unaffected because min_tokens defaults to None.

Caveats

  • vLLM still clamps max_tokens when prompt_tokens + max_tokens > max_model_len. If the prompt grows that large during a session, vLLM may emit fewer than min_tokens tokens and error or truncate. Watch error_rate and tune output_tokens_per_turn against your max_model_len.
  • min_tokens is a vLLM extension, not OpenAI-spec. Other OpenAI-compatible backends will ignore the field — the existing drift behavior persists there. This PR doesn't change behavior for non-vLLM servers.

…c output length

Conversation replay benchmarks compare model throughput by sampling a fixed
output-length distribution per turn and asserting that each turn emits exactly
that many tokens. In practice the sampled length is not honored: different model
families terminate at different chat-template stop tokens, so the same workload
yields different total output token counts on Qwen vs Llama vs Nemotron, which
invalidates tok/sec and convos/hr comparisons.

`ignore_eos=True` does not fix this. vLLM's `ignore_eos` only suppresses the
single canonical `tokenizer.eos_token_id`; it does not suppress the rest of the
ids in `generation_config.eos_token_id`. Qwen3's list is `[151645, 151643]`
(`<|im_end|>`, `<|endoftext|>`); Llama3's is `[128001, 128008, 128009]`. Once
the chat template stop token fires, generation stops regardless of `ignore_eos`.

The fix is vLLM's `min_tokens` parameter, which is checked before any stop
condition (EOS, stop_token_ids, custom stops). Setting `min_tokens == max_tokens`
forces deterministic per-turn output length across every model family.

Changes:
- `apis/completion.py`: add `min_tokens: Optional[int] = None` to
  `CompletionAPIData`; include in request body when set. Forwarded by
  `UserSessionCompletionAPIData` via the existing `super().to_request_body()`
  call, so conversation_replay picks it up.
- `datagen/conversation_replay_datagen.py`: set
  `min_tokens=bp.turn_output_lens[turn_idx]` on every emitted
  `_ConversationReplayAPIData`, parallel to the existing `max_tokens`.

Tests:
- `min_tokens` is omitted from the payload when unset (back-compat for callers
  that don't need it).
- When set, it round-trips through `to_request_body` verbatim.
- Every per-turn payload produced by `ConversationReplayDataGenerator` has
  `min_tokens == max_tokens == turn_output_lens[i]`.

Design note: `min_tokens` is always set for conversation_replay with no config
knob. The whole point of conversation_replay is reproducibility; making
deterministic length opt-in defeats the purpose, and there is no scenario where
a user wants the cross-model drift behavior back.
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: LoganVegnaSHOP
Once this PR has been reviewed and has the lgtm label, please assign arangogutierrez for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 20, 2026
LoganVegnaSHOP added a commit to LoganVegnaSHOP/inference-perf that referenced this pull request May 20, 2026
…c output length

Cherry-pick of upstream PR kubernetes-sigs#497 (kubernetes-sigs#497) ported to
the fork's pre-rename API surface (`to_payload` instead of `to_request_body`).
Drop this commit and rebase on top of upstream main once kubernetes-sigs#497 merges.

vLLM's `ignore_eos` only suppresses the canonical `tokenizer.eos_token_id`,
not the rest of `generation_config.eos_token_id`. Qwen3, Llama3, and Nemotron
each have additional chat-template stop tokens (`<|im_end|>`, `<|eot_id|>`,
etc.) that terminate generation early, so the same conversation_replay workload
produces different total output token counts across models — invalidating
cross-model tok/sec comparisons.

Fix: set `min_tokens == max_tokens` per turn. vLLM's `min_tokens` is checked
before any stop condition, so the model is forced to emit exactly the sampled
`turn_output_lens[i]` regardless of which stop tokens its generation_config
declares.

- `apis/completion.py`: add `min_tokens: Optional[int] = None` to
  `CompletionAPIData`; include in payload when set. Propagated through
  `UserSessionCompletionAPIData` via its existing `super().to_payload()` call.
- `datagen/conversation_replay_datagen.py`: set
  `min_tokens=bp.turn_output_lens[turn_idx]` parallel to `max_tokens` on
  every per-turn `_ConversationReplayAPIData`.
@@ -39,14 +46,17 @@ async def to_request_body(
) -> RequestBody:
if self.max_tokens == 0:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style nit: colocate the min_tokens and max_tokens checks?

@Bslabe123
Copy link
Copy Markdown
Contributor

Is this change limited to completions or should there also be a similar change in chat-completions?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants